Re: [PATCH v5 3/3] tcp: allow retransmit when peer receive window is zero

17 May 2024


      ...
On Thu, 16 May 2024 22:39:53 -0400
Jon Maloy  wrote:
...
On 2024-05-16 18:21, Stefano Brivio wrote:
...
On Thu, 16 May 2024 17:33:28 -0400
Jon Maloy  wrote:
...
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the peer, while it is unable to send out window updates even after
reads have freed up enough buffer space to permit a larger window.
In this situation, new window advertisemnts from the peer can only be
triggered by packets arriving from this side.
However, such packets are never sent, because the zero-window condition
currently prevents this side from sending out any packets whatsoever
to the peer.
We notice that the above bug is triggered *only* after the peer has
dropped an arriving packet because of severe memory squeeze, and that we
hence always enter a retransmission situation when this occurs. This
also means that it goes against the RFC 9293 recommendation that a
previously advertised window never should shrink.
RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
the following statement:
"A TCP receiver SHOULD NOT shrink the window, i.e., move the right
window edge to the left (SHLD-14). However, a sending TCP peer MUST
be robust against window shrinking, which may cause the
"usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).
If this happens, the sender SHOULD NOT send new data (SHLD-15), but
SHOULD retransmit normally the old unacknowledged data between SND.UNA
and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
beyond SND.UNA+SND.WND (MAY-7)"
We never see the window become negative, but we interpret this as a
recommendation to use the previously available window during
retransmission even when the currently advertised window is zero.
We use the above mechanism only for timer-induced retransmits, while
the fast-retransmit mechanism won't trigger on this condition.
It should be noted that although this solves the problem we have at
hand, it is not a genuine solution to the kernel bug. There may well
be TCP stacks around in other OS-es which don't do this, nor have
keep-alive probing as an alternatve way to solve the situation.
Signed-off-by: Jon Maloy 
---
v2: - Using previously advertised window during retransmission, instead
        highest send sequencece number in the cycle.
v3: - Rebased to newest code
      - Changes based on feedback from PASST team
      - Sending out empty probe message at timer expiration when
        we are not in retransmit situation.
v4: - Some small changes based on feedback from PASST team.
      - Replaced fast retransmit with a one-time 'fast probe' when
        window is zero.
v5: - Gave up on 'fast probing' for now. When I got the sequence
        numbers right in the flag message (after having emptied the tap
        queue), it turns out an empty message does *not* force a new peer
        window update as was my previous understanding of the code.
      - Added cppcheck suppression line (which I was unable to verify)
...hmm, why? All it takes is a 'make cppcheck'. Or is your version of
cppcheck not showing it?
Exactly. My version is too recent, I think.
...
as suggested by S. Brivio.
      - Removed sending of empty probe when window from tap side is zero.
        It looks pointless at the moment, at least for solving the above
        described situation.
So, as far as I understand, this patch now doesn't introduce any
functional change, except that if an existing timer happened to be
pending (this patch never schedules it),
   and the window _we_ are
advertising (wnd_to_tap) is zero (I'm not sure why that's relevant),
then we'll send an ACK segment.
The change is that we really will do a retransmit from the timer handler,
ignoring that current window is zero.
...but how does the window *we* advertise matter here? You're checking
conn->wnd_to_tap now, not conn->wnd_from_tap. Anyway, if you're now
abandoning this approach, it doesn't matter.
It doesn't matter at all. I stated in the the comment that I had removed
On 2024-05-17 08:15, Stefano Brivio wrote:
this one, but forgot to do  it. :-)
...
...
This solves the deadlock problem
I have been describing, but it is not a good solution.
One problem is that new incoming data will cause an EPOLLIN event,
which will cause a call to tcp_data_from_sock(), which will also ignore
the zero-window.
Right now it doesn't ignore it. Do you plan to use that function to
check if the usable window is now negative (because the window value is
zero but we have unacknowledged data), and re-send something in that
case?
Yes, it is ignoring it, because max_send is based on seq_wnd_edge_from_tap,
which wasn't moved when we received the zero-window adverisement.
...
...
This is easy to fix, I realize:
 if (events & EPOLLIN && conn->wnd_from_tap)
                 tcp_data_from_sock(c, conn);
This is my plan.
...
...
The next is that I would like to start a short timer when I receive
the zero-window and seq_to_tap > seq_ack_from_tap (same condition
as in previous versions, but without sending the flag packet).
As far as I can see there is no simple way to schedule the timer to
the value I want. I am open to suggestions here.
You could reuse/abuse ACK_TO_TAP_DUE (after all, you want to send a
keep-alive in a while), and schedule a different timeout in
tcp_timer_ctl() if the peer window is zero.
Then, in the timer handler, depending on the peer window, you would
check if you want to use ACK_IF_NEEDED (no keepalive), or ACK (force a
keepalive in any case).
As it is now, ACK_FROM_TAP_DUE with associated timer is always set when
we receive the zero-window, so it kind of works. The downside is that I
see a glitch of 1-2 seconds now and then until traffic recovers.
Trying with 10 ms as you suggest is much better. I'll give it a try.

///jon