[PATCH] tcp_splice: splice() all we have to the writing side, not what we just read
In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.
However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:
pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
Flow 0 (TCP connection (spliced)): 98304 from read-side call
Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
Flow 0 (TCP connection (spliced)): -1 from read-side call
Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
Flow 0 (TCP connection (spliced)): OUT_WAIT_0
we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:
pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
Flow 0 (TCP connection (spliced)): 32768 from read-side call
Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
we splice() 32768 more bytes from our receiving side to the pipe. At
some point:
pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
Flow 0 (TCP connection (spliced)): 1320 from read-side call
Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)
the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).
This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.
Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.
So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.
Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.
Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.
Reported-by: Ed Santiago
On Thu, Oct 24, 2024 at 04:11:10PM +0200, Stefano Brivio wrote:
In tcp_splice_sock_handler(), we try to calculate how much we can move from the pipe to the writing socket: if we just read some bytes, we'll use that amount, but if we haven't, we just try to empty the pipe.
However, if we just read something, that doesn't mean that that's all the data we have on the pipe, as it's obvious from this sequence, where:
pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001) Flow 0 (TCP connection (spliced)): 98304 from read-side call Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304) Flow 0 (TCP connection (spliced)): -1 from read-side call Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 Flow 0 (TCP connection (spliced)): OUT_WAIT_0
we first pile up 98304 - 33615 = 64689 pending bytes, that we read but couldn't write, as the receiver buffer is full, and we set the corresponding OUT_WAIT flag. Then:
pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001) Flow 0 (TCP connection (spliced)): 32768 from read-side call Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
we splice() 32768 more bytes from our receiving side to the pipe. At some point:
pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004) Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489 Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0 Flow 0 (TCP connection (spliced)): 1320 from read-side call Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)
the receiver is signalling to us that it's ready for more data (EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from our receiving socket into the pipe, and that's what we write to the receiver, forgetting about the pending 97457 bytes we had, which the receiver might never get (not the same 97547 bytes: we'll actually send 1320 of those).
This condition is rather hard to reproduce, and it was observed with Podman pulling container images via HTTPS. In the traces above, the client is side 0 (the initiating peer), and the server is sending the whole data.
Instead of splicing from pipe to socket the amount of data we just read, we need to splice all the pending data we piled up until that point. We could do that using 'read' and 'written' counters, but there's actually no need, as the kernel also keeps track of how much data is available on the pipe.
So, to make this simple and more robust, just give the whole pipe size as length to splice(). The kernel knows what to do with it.
Later in the function, we used 'to_write' for an optimisation meant to reduce wakeups which retries right away to splice() in both directions if we couldn't write to the receiver the whole amount of pending data. Calculate a 'pending' value instead, only if we reach that point.
Now that we check for the actual amount of pending data in that optimisation, we need to make sure we don't compare a zero or negative 'written' value: if we met that, it means that the receiver signalled end-of-file, an error, or to try again later. In those three cases, the optimisation doesn't make any sense, so skip it.
Reported-by: Ed Santiago
Reported-by: Paul Holzinger Analysed-by: Paul Holzinger Link: https://github.com/containers/podman/issues/24219 Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp_splice.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c index 9f5cc27..f112cfe 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -503,7 +503,7 @@ swap: lowat_act_flag = RCVLOWAT_ACT(fromsidei);
while (1) { - ssize_t readlen, to_write = 0, written; + ssize_t readlen, written, pending; int more = 0;
retry: @@ -518,14 +518,11 @@ retry:
if (errno != EAGAIN) goto close; - - to_write = c->tcp.pipe_size; } else if (!readlen) { eof = 1; - to_write = c->tcp.pipe_size; } else { never_read = 0; - to_write += readlen; + if (readlen >= (long)c->tcp.pipe_size * 90 / 100) more = SPLICE_F_MORE;
@@ -535,10 +532,10 @@ retry:
eintr: written = splice(conn->pipe[fromsidei][0], NULL, - conn->s[!fromsidei], NULL, to_write, + conn->s[!fromsidei], NULL, c->tcp.pipe_size, SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK); flow_trace(conn, "%zi from write-side call (passed %zi)", - written, to_write); + written, c->tcp.pipe_size);
/* Most common case: skip updating counters. */ if (readlen > 0 && readlen == written) { @@ -584,10 +581,9 @@ eintr: if (never_read && written == (long)(c->tcp.pipe_size)) goto retry;
- if (!never_read && written < to_write) { - to_write -= written; + pending = conn->read[fromsidei] - conn->written[fromsidei]; + if (!never_read && written > 0 && written < pending) goto retry; - }
if (eof) break;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio