[PATCH 0/8] splice() forwarding cleanups
As we discussed on an earlier call, while fixing bug 202 I noticed a number of warts in the surrounding splice() forwarding code. Most are just things that are longer or harder to follow than they need to be, but in some cases there may be real (if unlikely to trigger) bugs. Here's a collection of fixes. David Gibson (8): tcp_splice: Remove never-invoked SO_RCVLOWAT logic tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling tcp_splice: Improve EOF exit condition for the loop tcp_splice: Remove goto from forwarding loop tcp_splice: Simplify shutdown(2) handling tcp_splice: Simplify / correct OUT_WAIT flag handling tcp_splice: Remove questionable "optimisation" of pending bytes tracking tcp_splice: Exit forwarding earlier when stalled read side tcp_splice.c | 100 ++++++++++++++++----------------------------------- 1 file changed, 31 insertions(+), 69 deletions(-) -- 2.54.0
The forwarding look in tcp_splice_forward() has a retry label that we goto
in some cases. However, the only difference between a 'goto retry' and
a 'continue' is that the 'continue' will reset the 'more' variable to 0.
The fist goto retry only occurs if never_read is set, which can only be
the case if we never changed 'more' in the first place, so is strictly
equivalent to a continue. In the second case, 'more' can be set though.
'more' is set by a heuristic that if we're able to read most of a pipe's
worth of data at once, there's probably more coming, so we should prepare
the write-side for that. However, on a goto retry we have a new read side
splice. If this time we *don't* get most of a pipe's worth of data, that
suggests that contrary to expectations from the previous loop we have now
temporarily run out of input data and so SPLICE_F_MORE is no longer
a good guess for the next write side splice(). In other words, the second
read-splice() gives us better data for the heuristic than keeping our guess
from the first one, so resetting 'more' is valuable.
So, we could replace both gotos with continues. But they're already at the
end the loop body, so a continue is a no-op. Just remove them. That, in
turn removes the need for the never_read variable.
Signed-off-by: David Gibson
In tcp_splice_forward() we exit the forwarding loop if we have an EOF on
the read side. However, this potentially leaves data in the pipe, even if
the write side hasn't yet blocked. It's not clear to me whether this could
leave data indefinitely in the pipe with no events to keep it moving,
but it's not clear to me that it couldn't either.
Stay in the loop until either the write side blocks or we've emptied
the pipe.
Secondly, this test is after several tests on how much we wrote which
might also cause a retry. However, if we've reached EOF and the pipe is
empty, there's nothing more to do, regardless of how much we wrote, so
we should exit, regardless of those conditions. So move this exit test
above the retry conditions.
Signed-off-by: David Gibson
There are two ways we can tell one of our sockets has received a FIN. We
can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
(EOF) on the socket. We currently use both, in a mildly confusing way:
we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
some other close out logic is based on seeing an EOF.
Simplify this by setting the flag based on only the EOF. To make sure we
don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
forwarding path for EPOLLRDHUP as well as EPOLLIN.
Signed-off-by: David Gibson
tcp_splice_forward() contains some logic to use the SO_RCVLOWAT
setsockopt(). This appears to be aimed at interrupt (epoll) mitigation, so
that we're not always waking for a socket that's getting frequent small
amounts of data.
However, the logic is never invoked, and hasn't been since at least
2022_07_14.b86cd00: it's conditional on
readlen > (long)c->tcp.pipe_size / 10
However, immediately before that we've invoked 'continue' if:
readlen >= (long)c->tcp_pipe_size * 10 / 100
which is a strictly weaker condition.
While it's possible we want to restore a working version of that interrupt
mitigation at some point, for the time being this logic just confuses the
picture and makes some other cleanups more awkward. We haven't had it
for over 3 years, so it's clearly not vital.
Signed-off-by: David Gibson
At the end of our loop we have a conditional 'break' that exits if we're
at EOF on the read side and have nothing left in the pipe. This doesn't
depend on anything write-side, so we can move it earlier, avoiding an
unnecessary write side splice in this case.
Furthermore, there's also nothing to be done write side if we've hit EAGAIN
on the read side and the pipe is empty, so exit early for that case as
well.
Signed-off-by: David Gibson
We set the OUT_WAIT flag if we stop forwarding due to EAGAIN, but there's
still data in the pipe. That ensures we wake up when the output socket has
room to drain the pipe into.
We clear the OUT_WAIT flag when we complete forwarding on an EPOLLOUT
event, but that's not quite right. Even though it's called on an EPOLLOUT,
tcp_splice_forward() could, in principle empty the pipe, but also read
enough new data from the other side to fill it again. That would set
OUT_WAIT internally, but it would be cleared after returning meaning
we could miss a necessary wakeup.
The condition on whether we need write side wakeups is actually fairly
simple: we need them if and only if we return to the main loop with data
in the pipe. Maintain that in a single place - right after we exit the
forwarding loop in tcp_splice_forward().
Signed-off-by: David Gibson
At the end of tcp_splice_forward(), we check for half-closed connections
in either direction and propagate the FIN to the other side with a
shutdown(2).
However, it's unnecessary to check both directions: a FIN from side X will
cause an EPOLLRDUP on side X's socket, which will trigger
tcp_splice_forward() from side X to side !X. Likewise for the other side.
So we only need to check for "forward" FIN propagation.
Signed-off-by: David Gibson
We have a special path that avoids updating conn->pending when the amounts
read and written are equal. This has a conceptual complexity cost, in
particular, it means that conn->pending[] is not accurate to its normal
meaning for a section of the loop body.
conn->pending[] shares a cacheline with conn->pipe[] and conn->s[], so it's
almost certainly cache-hot. It's questionable that avoiding the update
of pending even outweighs the extra conditional branch, let alone saves
anything of significance. Remove it.
This allows us to move the updates to conn->pending closer to the actual
splice() calls, making it easier to reason about its value. It also lets
us move the conn->pending updates so they can piggy back on existing tests
rather than needing a conditional expression to avoid clobbering it when
splice() returns -1 (EAGAIN).
Signed-off-by: David Gibson
participants (1)
-
David Gibson