On 2025-01-18 15:04, Neal Cardwell wrote:
On Fri, Jan 17, 2025 at 4:41 PM
<jmaloy(a)redhat.com> wrote:
From: Jon Maloy <jmaloy(a)redhat.com>
Testing with iperf3 using the "pasta" protocol splicer has revealed
a bug in the way tcp handles window advertising in extreme memory
squeeze situations.
Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.
However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
The "if we happen to stall at an unfortunate value of the current
window size" phrase is a little vague... :-) Do you have a sense of
what might count as "unfortunate" here? That might help in crafting a
packetdrill test to reproduce this and have an automated regression
test.
Obviously, it happens when the following code snippet in
__tcp_cleanup_rbuf() {
[....]
if (copied > 0 && !time_to_ack &&
!(sk->sk_shutdown & RCV_SHUTDOWN)) {
__u32 rcv_window_now = tcp_receive_window(tp);
/* Optimize, __tcp_select_window() is not cheap. */
if (2*rcv_window_now <= tp->window_clamp) {
__u32 new_window = __tcp_select_window(sk);
/* Send ACK now, if this read freed lots of space
* in our buffer. Certainly, new_window is new window.
* We can advertise it now, if it is not less than
* current one.
* "Lots" means "at least twice" here.
*/
if (new_window && new_window >= 2 * rcv_window_now)
time_to_ack = true;
}
}
[....]
}
yields time_to_ack = false, i.e. __tcp_select_window(sk) returns
a value new_window < (2 * tcp_receive_window(tp)).
In my log I have for brevity used the following names:
win_now: same as rcv_window_now
(= tcp_receive_window(tp),
= tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
= 265469200 + 262144 - 265600160,
= 131184)
new_win: same as new_window
(= __tcp_select_window(sk),
= 0 first time, later 262144 )
rcv_wnd: same as tp->rcv_wnd,
(=262144)
We see that although the last test actually is pretty close
(262144 >= 262368 ? => false) it is not close enough.
We also notice that
(tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
(new_window >= 2 * rcv_window_now) will always be false.
Too me it looks like __tcp_select_window(sk) doesn't at all take the
freed-up memory into account when calculating a new window. I haven't
looked into why that is happening.
This means that this side's notion of the
current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.
Since the peer last saw a zero receive window at the time of the
memory-pressure drop, shouldn't the peer be sending repeated zero
window probes, and shouldn't the local host respond to a ZWP with an
ACK with the correct non-zero window?
It should, but at the moment when I found this bug the peer stack was
not the Linux kernel stack, but one we develop for our own purpose. We
fixed that later, but it still means that traffic stops for a couple of
seconds now and then before the timer restarts the flow. This happens
too often for comfort in our usage scenarios.
We can of course blame the the peer stack, but I still feel this is a
bug, and that it could be handled better by the kernel stack.
Do you happen to have a tcpdump .pcap of one of these cases that you
can share?
I had one, although not for this particular run, and I cannot find it
right now. I will continue looking or make a new one. Is there some
shared space I can put it?
Here it is. Look at frame #1067.
The problem occurs on the iperf3 server side, and
the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:
// tcp_v4_rcv(->)
// tcp_rcv_established(->)
[5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/
tcp_data_queue()/5257 ====
[5201<->39222]: tcp_data_queue(->)
[5201<->39222]: DROPPING skb [265600160..265665640], reason:
SKB_DROP_REASON_PROTO_MEM
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack
265469200, win_now 131184]
What is "win_now"? That doesn't seem to correspond to any variable
name in the Linux source tree.
See above.
Can this be renamed to the
tcp_select_window() variable it is printing, like
"cur_win" or
"effective_win" or "new_win", etc?
Or perhaps you can attach your debugging patch in some email thread? I
agree with Eric that these debug dumps are a little hard to parse
without seeing the patch that allows us to understand what some of
these fields are...
I agree with Eric that probably tp->pred_flags should be cleared, and
a packetdrill test for this would be super-helpful.
I must admit I have never used packetdrill, but I can make an effort.
I hear from other sources that you cannot force a memory exhaustion with
packetdrill anyway, so this sounds like a pointless exercise.