On Thu, 16 May 2024 22:39:53 -0400 Jon Maloy <jmaloy(a)redhat.com> wrote:On 2024-05-16 18:21, Stefano Brivio wrote:...but how does the window *we* advertise matter here? You're checking conn->wnd_to_tap now, not conn->wnd_from_tap. Anyway, if you're now abandoning this approach, it doesn't matter.On Thu, 16 May 2024 17:33:28 -0400 Jon Maloy <jmaloy(a)redhat.com> wrote:Exactly. My version is too recent, I think.A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side. However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC 9293 recommendation that a previously advertised window never should shrink. RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)" We never see the window become negative, but we interpret this as a recommendation to use the previously available window during retransmission even when the currently advertised window is zero. We use the above mechanism only for timer-induced retransmits, while the fast-retransmit mechanism won't trigger on this condition. It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation. Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. v3: - Rebased to newest code - Changes based on feedback from PASST team - Sending out empty probe message at timer expiration when we are not in retransmit situation. v4: - Some small changes based on feedback from PASST team. - Replaced fast retransmit with a one-time 'fast probe' when window is zero. v5: - Gave up on 'fast probing' for now. When I got the sequence numbers right in the flag message (after having emptied the tap queue), it turns out an empty message does *not* force a new peer window update as was my previous understanding of the code. - Added cppcheck suppression line (which I was unable to verify)...hmm, why? All it takes is a 'make cppcheck'. Or is your version of cppcheck not showing it?The change is that we really will do a retransmit from the timer handler, ignoring that current window is zero.as suggested by S. Brivio. - Removed sending of empty probe when window from tap side is zero. It looks pointless at the moment, at least for solving the above described situation.So, as far as I understand, this patch now doesn't introduce any functional change, except that if an existing timer happened to be pending (this patch never schedules it), and the window _we_ are advertising (wnd_to_tap) is zero (I'm not sure why that's relevant), then we'll send an ACK segment.This solves the deadlock problem I have been describing, but it is not a good solution. One problem is that new incoming data will cause an EPOLLIN event, which will cause a call to tcp_data_from_sock(), which will also ignore the zero-window.Right now it doesn't ignore it. Do you plan to use that function to check if the usable window is now negative (because the window value is zero but we have unacknowledged data), and re-send something in that case?This is easy to fix, I realize: if (events & EPOLLIN && conn->wnd_from_tap) tcp_data_from_sock(c, conn); The next is that I would like to start a short timer when I receive the zero-window and seq_to_tap > seq_ack_from_tap (same condition as in previous versions, but without sending the flag packet). As far as I can see there is no simple way to schedule the timer to the value I want. I am open to suggestions here.You could reuse/abuse ACK_TO_TAP_DUE (after all, you want to send a keep-alive in a while), and schedule a different timeout in tcp_timer_ctl() if the peer window is zero. Then, in the timer handler, depending on the peer window, you would check if you want to use ACK_IF_NEEDED (no keepalive), or ACK (force a keepalive in any case). -- Stefano
On 2024-05-17 08:15, Stefano Brivio wrote:On Thu, 16 May 2024 22:39:53 -0400 Jon Maloy <jmaloy(a)redhat.com> wrote:It doesn't matter at all. I stated in the the comment that I had removed this one, but forgot to do it. :-)On 2024-05-16 18:21, Stefano Brivio wrote:...but how does the window *we* advertise matter here? You're checking conn->wnd_to_tap now, not conn->wnd_from_tap. Anyway, if you're now abandoning this approach, it doesn't matter.On Thu, 16 May 2024 17:33:28 -0400 Jon Maloy <jmaloy(a)redhat.com> wrote:Exactly. My version is too recent, I think.A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side. However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC 9293 recommendation that a previously advertised window never should shrink. RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)" We never see the window become negative, but we interpret this as a recommendation to use the previously available window during retransmission even when the currently advertised window is zero. We use the above mechanism only for timer-induced retransmits, while the fast-retransmit mechanism won't trigger on this condition. It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation. Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. v3: - Rebased to newest code - Changes based on feedback from PASST team - Sending out empty probe message at timer expiration when we are not in retransmit situation. v4: - Some small changes based on feedback from PASST team. - Replaced fast retransmit with a one-time 'fast probe' when window is zero. v5: - Gave up on 'fast probing' for now. When I got the sequence numbers right in the flag message (after having emptied the tap queue), it turns out an empty message does *not* force a new peer window update as was my previous understanding of the code. - Added cppcheck suppression line (which I was unable to verify)...hmm, why? All it takes is a 'make cppcheck'. Or is your version of cppcheck not showing it?The change is that we really will do a retransmit from the timer handler, ignoring that current window is zero.as suggested by S. Brivio. - Removed sending of empty probe when window from tap side is zero. It looks pointless at the moment, at least for solving the above described situation.So, as far as I understand, this patch now doesn't introduce any functional change, except that if an existing timer happened to be pending (this patch never schedules it), and the window _we_ are advertising (wnd_to_tap) is zero (I'm not sure why that's relevant), then we'll send an ACK segment.Yes, it is ignoring it, because max_send is based on seq_wnd_edge_from_tap, which wasn't moved when we received the zero-window adverisement.This solves the deadlock problem I have been describing, but it is not a good solution. One problem is that new incoming data will cause an EPOLLIN event, which will cause a call to tcp_data_from_sock(), which will also ignore the zero-window.Right now it doesn't ignore it. Do you plan to use that function to check if the usable window is now negative (because the window value is zero but we have unacknowledged data), and re-send something in that case?> This is easy to fix, I realize: > > if (events & EPOLLIN && conn->wnd_from_tap) > tcp_data_from_sock(c, conn);This is my plan.As it is now, ACK_FROM_TAP_DUE with associated timer is always set when we receive the zero-window, so it kind of works. The downside is that I see a glitch of 1-2 seconds now and then until traffic recovers. Trying with 10 ms as you suggest is much better. I'll give it a try. ///jonThe next is that I would like to start a short timer when I receive the zero-window and seq_to_tap > seq_ack_from_tap (same condition as in previous versions, but without sending the flag packet). As far as I can see there is no simple way to schedule the timer to the value I want. I am open to suggestions here.You could reuse/abuse ACK_TO_TAP_DUE (after all, you want to send a keep-alive in a while), and schedule a different timeout in tcp_timer_ctl() if the peer window is zero. Then, in the timer handler, depending on the peer window, you would check if you want to use ACK_IF_NEEDED (no keepalive), or ACK (force a keepalive in any case).