[PATCH 0/8] splice() forwarding cleanups

David Gibson

28 May 2026 28 May '26

7:02 a.m.

As we discussed on an earlier call, while fixing bug 202 I noticed a number of warts in the surrounding splice() forwarding code. Most are just things that are longer or harder to follow than they need to be, but in some cases there may be real (if unlikely to trigger) bugs. Here's a collection of fixes. David Gibson (8): tcp_splice: Remove never-invoked SO_RCVLOWAT logic tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling tcp_splice: Improve EOF exit condition for the loop tcp_splice: Remove goto from forwarding loop tcp_splice: Simplify shutdown(2) handling tcp_splice: Simplify / correct OUT_WAIT flag handling tcp_splice: Remove questionable "optimisation" of pending bytes tracking tcp_splice: Exit forwarding earlier when stalled read side tcp_splice.c | 100 ++++++++++++++++----------------------------------- 1 file changed, 31 insertions(+), 69 deletions(-) -- 2.54.0

Show replies by date

David Gibson

28 May 28 May

7:02 a.m.

New subject: [PATCH 4/8] tcp_splice: Remove goto from forwarding loop

The forwarding look in tcp_splice_forward() has a retry label that we goto in some cases. However, the only difference between a 'goto retry' and a 'continue' is that the 'continue' will reset the 'more' variable to 0. The fist goto retry only occurs if never_read is set, which can only be the case if we never changed 'more' in the first place, so is strictly equivalent to a continue. In the second case, 'more' can be set though. 'more' is set by a heuristic that if we're able to read most of a pipe's worth of data at once, there's probably more coming, so we should prepare the write-side for that. However, on a goto retry we have a new read side splice. If this time we *don't* get most of a pipe's worth of data, that suggests that contrary to expectations from the previous loop we have now temporarily run out of input data and so SPLICE_F_MORE is no longer a good guess for the next write side splice(). In other words, the second read-splice() gives us better data for the heuristic than keeping our guess from the first one, so resetting 'more' is valuable. So, we could replace both gotos with continues. But they're already at the end the loop body, so a continue is a no-op. Just remove them. That, in turn removes the need for the never_read variable. Signed-off-by: David Gibson --- tcp_splice.c | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 943dc214..8c8e3bbb 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -476,13 +476,11 @@ static int tcp_splice_forward(struct ctx *c, { uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei); uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei); - int never_read = 1; while (1) { ssize_t readlen, written; int more = 0; -retry: do readlen = splice(conn->s[fromsidei], NULL, conn->pipe[fromsidei][1], NULL, @@ -502,8 +500,6 @@ retry: if (!readlen) { conn_event(conn, FIN_RCVD(fromsidei)); } else if (readlen > 0) { - never_read = 0; - if (readlen >= (long)c->tcp.pipe_size * 90 / 100) more = SPLICE_F_MORE; @@ -546,13 +542,6 @@ retry: if (conn->events & FIN_RCVD(fromsidei) && !conn->pending[fromsidei]) break; - - if (never_read && written == (long)(c->tcp.pipe_size)) - goto retry; - - if (!never_read && written > 0 && - written < conn->pending[fromsidei]) - goto retry; } if (!conn->pending[fromsidei] && -- 2.54.0

David Gibson

7:02 a.m.

New subject: [PATCH 3/8] tcp_splice: Improve EOF exit condition for the loop

In tcp_splice_forward() we exit the forwarding loop if we have an EOF on the read side. However, this potentially leaves data in the pipe, even if the write side hasn't yet blocked. It's not clear to me whether this could leave data indefinitely in the pipe with no events to keep it moving, but it's not clear to me that it couldn't either. Stay in the loop until either the write side blocks or we've emptied the pipe. Secondly, this test is after several tests on how much we wrote which might also cause a retry. However, if we've reached EOF and the pipe is empty, there's nothing more to do, regardless of how much we wrote, so we should exit, regardless of those conditions. So move this exit test above the retry conditions. Signed-off-by: David Gibson --- tcp_splice.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 25e5d097..943dc214 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -543,15 +543,16 @@ retry: break; } + if (conn->events & FIN_RCVD(fromsidei) && + !conn->pending[fromsidei]) + break; + if (never_read && written == (long)(c->tcp.pipe_size)) goto retry; if (!never_read && written > 0 && written < conn->pending[fromsidei]) goto retry; - - if (conn->events & FIN_RCVD(fromsidei)) - break; } if (!conn->pending[fromsidei] && -- 2.54.0

David Gibson

7:02 a.m.

New subject: [PATCH 2/8] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling

There are two ways we can tell one of our sockets has received a FIN. We can either see an EPOLLRDHUP epoll event, or we can get a zero-length read (EOF) on the socket. We currently use both, in a mildly confusing way: we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then some other close out logic is based on seeing an EOF. Simplify this by setting the flag based on only the EOF. To make sure we don't miss an event if we get an EPOLLRDHUP with no data, we trigger the forwarding path for EPOLLRDHUP as well as EPOLLIN. Signed-off-by: David Gibson --- tcp_splice.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index c066d689..25e5d097 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -477,7 +477,6 @@ static int tcp_splice_forward(struct ctx *c, uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei); uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei); int never_read = 1; - int eof = 0; while (1) { ssize_t readlen, written; @@ -501,7 +500,7 @@ retry: flow_trace(conn, "%zi from read-side call", readlen); if (!readlen) { - eof = 1; + conn_event(conn, FIN_RCVD(fromsidei)); } else if (readlen > 0) { never_read = 0; @@ -551,11 +550,12 @@ retry: written < conn->pending[fromsidei]) goto retry; - if (eof) + if (conn->events & FIN_RCVD(fromsidei)) break; } - if (!conn->pending[fromsidei] && eof) { + if (!conn->pending[fromsidei] && + conn->events & FIN_RCVD(fromsidei)) { unsigned sidei; flow_foreach_sidei(sidei) { @@ -620,17 +620,13 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, goto reset; } - if (events & EPOLLRDHUP) - /* For side 0 this is fake, but implied */ - conn_event(conn, FIN_RCVD(evsidei)); - if (events & EPOLLOUT) { if (tcp_splice_forward(c, conn, !evsidei, now)) goto reset; conn_event(conn, ~OUT_WAIT(evsidei)); } - if (events & EPOLLIN) { + if (events & (EPOLLIN | EPOLLRDHUP)) { if (tcp_splice_forward(c, conn, evsidei, now)) goto reset; } -- 2.54.0

David Gibson

7:02 a.m.

New subject: [PATCH 1/8] tcp_splice: Remove never-invoked SO_RCVLOWAT logic

tcp_splice_forward() contains some logic to use the SO_RCVLOWAT setsockopt(). This appears to be aimed at interrupt (epoll) mitigation, so that we're not always waking for a socket that's getting frequent small amounts of data. However, the logic is never invoked, and hasn't been since at least 2022_07_14.b86cd00: it's conditional on readlen > (long)c->tcp.pipe_size / 10 However, immediately before that we've invoked 'continue' if: readlen >= (long)c->tcp_pipe_size * 10 / 100 which is a strictly weaker condition. While it's possible we want to restore a working version of that interrupt mitigation at some point, for the time being this logic just confuses the picture and makes some other cleanups more awkward. We haven't had it for over 3 years, so it's clearly not vital. Signed-off-by: David Gibson --- tcp_splice.c | 22 +--------------------- 1 file changed, 1 insertion(+), 21 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 1f969a5c..c066d689 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -530,28 +530,8 @@ retry: written, c->tcp.pipe_size); /* Most common case: skip updating count of pending bytes */ - if (readlen > 0 && readlen == written) { - if (readlen >= (long)c->tcp.pipe_size * 10 / 100) - continue; - - if (!(conn->flags & lowat_set_flag) && - readlen > (long)c->tcp.pipe_size / 10) { - int lowat = c->tcp.pipe_size / 4; - - if (setsockopt(conn->s[fromsidei], SOL_SOCKET, - SO_RCVLOWAT, - &lowat, sizeof(lowat))) { - flow_trace(conn, - "Setting SO_RCVLOWAT %i: %s", - lowat, strerror_(errno)); - } else { - conn_flag(conn, lowat_set_flag); - conn_flag(conn, lowat_act_flag); - } - } - + if (readlen > 0 && readlen == written) continue; - } conn->pending[fromsidei] += readlen > 0 ? readlen : 0; conn->pending[fromsidei] -= written > 0 ? written : 0; -- 2.54.0

David Gibson

7:02 a.m.

New subject: [PATCH 8/8] tcp_splice: Exit forwarding earlier when stalled read side

At the end of our loop we have a conditional 'break' that exits if we're at EOF on the read side and have nothing left in the pipe. This doesn't depend on anything write-side, so we can move it earlier, avoiding an unnecessary write side splice in this case. Furthermore, there's also nothing to be done write side if we've hit EAGAIN on the read side and the pipe is empty, so exit early for that case as well. Signed-off-by: David Gibson --- tcp_splice.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 565596d3..623ca926 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -497,9 +497,17 @@ static int tcp_splice_forward(struct ctx *c, flow_trace(conn, "%zi from read-side call", readlen); - if (!readlen) { - conn_event(conn, FIN_RCVD(fromsidei)); - } else if (readlen > 0) { + if (readlen <= 0) { + if (!readlen) /* EOF */ + conn_event(conn, FIN_RCVD(fromsidei)); + + /* We're either blocked or at EOF on the read side, and + * there's nothing in the pipe so there's nothing to do + * write side either. + */ + if (!conn->pending[fromsidei]) + break; + } else { conn->pending[fromsidei] += readlen; if (readlen >= (long)c->tcp.pipe_size * 90 / 100) @@ -530,10 +538,6 @@ static int tcp_splice_forward(struct ctx *c, break; conn->pending[fromsidei] -= written; - - if (conn->events & FIN_RCVD(fromsidei) && - !conn->pending[fromsidei]) - break; } /* We need write-side wakeups if and only if we have data in the pipe to -- 2.54.0

David Gibson

7:02 a.m.

New subject: [PATCH 6/8] tcp_splice: Simplify / correct OUT_WAIT flag handling

We set the OUT_WAIT flag if we stop forwarding due to EAGAIN, but there's still data in the pipe. That ensures we wake up when the output socket has room to drain the pipe into. We clear the OUT_WAIT flag when we complete forwarding on an EPOLLOUT event, but that's not quite right. Even though it's called on an EPOLLOUT, tcp_splice_forward() could, in principle empty the pipe, but also read enough new data from the other side to fill it again. That would set OUT_WAIT internally, but it would be cleared after returning meaning we could miss a necessary wakeup. The condition on whether we need write side wakeups is actually fairly simple: we need them if and only if we return to the main loop with data in the pipe. Maintain that in a single place - right after we exit the forwarding loop in tcp_splice_forward(). Signed-off-by: David Gibson --- tcp_splice.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 42902684..5f412584 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -531,19 +531,22 @@ static int tcp_splice_forward(struct ctx *c, conn->pending[fromsidei] += readlen > 0 ? readlen : 0; conn->pending[fromsidei] -= written > 0 ? written : 0; - if (written < 0) { - if (!conn->pending[fromsidei]) - break; - - conn_event(conn, OUT_WAIT(!fromsidei)); + if (written < 0) break; - } if (conn->events & FIN_RCVD(fromsidei) && !conn->pending[fromsidei]) break; } + /* We need write-side wakeups if and only if we have data in the pipe to + * drain. + */ + if (conn->pending[fromsidei]) + conn_event(conn, OUT_WAIT(!fromsidei)); + else + conn_event(conn, ~OUT_WAIT(!fromsidei)); + if ((conn->events & FIN_RCVD(fromsidei)) && !(conn->events & FIN_SENT(!fromsidei)) && !conn->pending[fromsidei]) { @@ -606,7 +609,6 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, if (events & EPOLLOUT) { if (tcp_splice_forward(c, conn, !evsidei, now)) goto reset; - conn_event(conn, ~OUT_WAIT(evsidei)); } if (events & (EPOLLIN | EPOLLRDHUP)) { -- 2.54.0

David Gibson

7:02 a.m.

New subject: [PATCH 5/8] tcp_splice: Simplify shutdown(2) handling

At the end of tcp_splice_forward(), we check for half-closed connections in either direction and propagate the FIN to the other side with a shutdown(2). However, it's unnecessary to check both directions: a FIN from side X will cause an EPOLLRDUP on side X's socket, which will trigger tcp_splice_forward() from side X to side !X. Likewise for the other side. So we only need to check for "forward" FIN propagation. Signed-off-by: David Gibson --- tcp_splice.c | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 8c8e3bbb..42902684 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -544,22 +544,15 @@ static int tcp_splice_forward(struct ctx *c, break; } - if (!conn->pending[fromsidei] && - conn->events & FIN_RCVD(fromsidei)) { - unsigned sidei; - - flow_foreach_sidei(sidei) { - if ((conn->events & FIN_RCVD(sidei)) && - !(conn->events & FIN_SENT(!sidei))) { - if (shutdown(conn->s[!sidei], SHUT_WR) < 0) { - flow_perror_ratelimit( - conn, now, "shutdown() on %s", - pif_name(conn->f.pif[!sidei])); - return -1; - } - conn_event(conn, FIN_SENT(!sidei)); - } + if ((conn->events & FIN_RCVD(fromsidei)) && + !(conn->events & FIN_SENT(!fromsidei)) && + !conn->pending[fromsidei]) { + if (shutdown(conn->s[!fromsidei], SHUT_WR) < 0) { + flow_perror_ratelimit(conn, now, "shutdown() on %s", + pif_name(conn->f.pif[!fromsidei])); + return -1; } + conn_event(conn, FIN_SENT(!fromsidei)); } return 0; -- 2.54.0

David Gibson

7:02 a.m.

New subject: [PATCH 7/8] tcp_splice: Remove questionable "optimisation" of pending bytes tracking

We have a special path that avoids updating conn->pending when the amounts read and written are equal. This has a conceptual complexity cost, in particular, it means that conn->pending[] is not accurate to its normal meaning for a section of the loop body. conn->pending[] shares a cacheline with conn->pipe[] and conn->s[], so it's almost certainly cache-hot. It's questionable that avoiding the update of pending even outweighs the extra conditional branch, let alone saves anything of significance. Remove it. This allows us to move the updates to conn->pending closer to the actual splice() calls, making it easier to reason about its value. It also lets us move the conn->pending updates so they can piggy back on existing tests rather than needing a conditional expression to avoid clobbering it when splice() returns -1 (EAGAIN). Signed-off-by: David Gibson --- tcp_splice.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 5f412584..565596d3 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -500,6 +500,8 @@ static int tcp_splice_forward(struct ctx *c, if (!readlen) { conn_event(conn, FIN_RCVD(fromsidei)); } else if (readlen > 0) { + conn->pending[fromsidei] += readlen; + if (readlen >= (long)c->tcp.pipe_size * 90 / 100) more = SPLICE_F_MORE; @@ -524,16 +526,11 @@ static int tcp_splice_forward(struct ctx *c, flow_trace(conn, "%zi from write-side call (passed %zi)", written, c->tcp.pipe_size); - /* Most common case: skip updating count of pending bytes */ - if (readlen > 0 && readlen == written) - continue; - - conn->pending[fromsidei] += readlen > 0 ? readlen : 0; - conn->pending[fromsidei] -= written > 0 ? written : 0; - if (written < 0) break; + conn->pending[fromsidei] -= written; + if (conn->events & FIN_RCVD(fromsidei) && !conn->pending[fromsidei]) break; -- 2.54.0

Stefano Brivio

4 Jun 4 Jun

6:41 a.m.

New subject: [PATCH 6/8] tcp_splice: Simplify / correct OUT_WAIT flag handling

On Thu, 28 May 2026 15:02:11 +1000 David Gibson wrote:

...

We set the OUT_WAIT flag if we stop forwarding due to EAGAIN, but there's still data in the pipe. That ensures we wake up when the output socket has room to drain the pipe into.

We clear the OUT_WAIT flag when we complete forwarding on an EPOLLOUT event, but that's not quite right. Even though it's called on an EPOLLOUT, tcp_splice_forward() could, in principle empty the pipe, but also read enough new data from the other side to fill it again. That would set OUT_WAIT internally, but it would be cleared after returning meaning we could miss a necessary wakeup.

The current logic in tcp_splice_sock_handler(): if (events & EPOLLOUT) { if (tcp_splice_forward(c, conn, !evsidei, now)) goto reset; conn_event(conn, ~OUT_WAIT(evsidei)); } if (events & EPOLLIN) { if (tcp_splice_forward(c, conn, evsidei, now)) goto reset; } would prevent the case you described, because if we read new data from the other side filling the pipe, we'll hit (events & EPOLLIN) and set OUT_WAIT again if needed. But there's a case this should actually fix, even though I've never seen it happening in practice: what if we *don't* read new data from the other side, and we can't empty the pipe in one EPOLLOUT shot anyway? I hadn't considered that before but if the receiver is slow enough that's probably possible.

...

The condition on whether we need write side wakeups is actually fairly simple: we need them if and only if we return to the main loop with data in the pipe. Maintain that in a single place - right after we exit the forwarding loop in tcp_splice_forward().

Signed-off-by: David Gibson --- tcp_splice.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c index 42902684..5f412584 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -531,19 +531,22 @@ static int tcp_splice_forward(struct ctx *c, conn->pending[fromsidei] += readlen > 0 ? readlen : 0; conn->pending[fromsidei] -= written > 0 ? written : 0;

- if (written < 0) { - if (!conn->pending[fromsidei]) - break; - - conn_event(conn, OUT_WAIT(!fromsidei)); + if (written < 0) break; - }

if (conn->events & FIN_RCVD(fromsidei) && !conn->pending[fromsidei]) break; }

+ /* We need write-side wakeups if and only if we have data in the pipe to + * drain. + */ + if (conn->pending[fromsidei]) + conn_event(conn, OUT_WAIT(!fromsidei)); + else + conn_event(conn, ~OUT_WAIT(!fromsidei)); + if ((conn->events & FIN_RCVD(fromsidei)) && !(conn->events & FIN_SENT(!fromsidei)) && !conn->pending[fromsidei]) { @@ -606,7 +609,6 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, if (events & EPOLLOUT) { if (tcp_splice_forward(c, conn, !evsidei, now)) goto reset; - conn_event(conn, ~OUT_WAIT(evsidei)); }

if (events & (EPOLLIN | EPOLLRDHUP)) {

-- Stefano

Stefano Brivio

6:41 a.m.

New subject: [PATCH 7/8] tcp_splice: Remove questionable "optimisation" of pending bytes tracking

On Thu, 28 May 2026 15:02:12 +1000 David Gibson wrote:

...

We have a special path that avoids updating conn->pending when the amounts read and written are equal. This has a conceptual complexity cost, in particular, it means that conn->pending[] is not accurate to its normal meaning for a section of the loop body.

conn->pending[] shares a cacheline with conn->pipe[] and conn->s[], so it's almost certainly cache-hot. It's questionable that avoiding the update of pending even outweighs the extra conditional branch, let alone saves anything of significance. Remove it.

I added this when we still had 64-bit counters, that is, two weeks before commit 37c228ada88b ("tap, tcp, udp, icmp: Cut down on some oversized buffers"), but now, as you point out, it doesn't make sense anymore. -- Stefano

Stefano Brivio

6:41 a.m.

New subject: [PATCH 8/8] tcp_splice: Exit forwarding earlier when stalled read side

On Thu, 28 May 2026 15:02:13 +1000 David Gibson wrote:

...

At the end of our loop we have a conditional 'break' that exits if we're at EOF on the read side and have nothing left in the pipe. This doesn't depend on anything write-side, so we can move it earlier, avoiding an unnecessary write side splice in this case.

Furthermore, there's also nothing to be done write side if we've hit EAGAIN on the read side and the pipe is empty, so exit early for that case as well.

Signed-off-by: David Gibson --- tcp_splice.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c index 565596d3..623ca926 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -497,9 +497,17 @@ static int tcp_splice_forward(struct ctx *c,

flow_trace(conn, "%zi from read-side call", readlen);

- if (!readlen) { - conn_event(conn, FIN_RCVD(fromsidei)); - } else if (readlen > 0) { + if (readlen <= 0) { + if (!readlen) /* EOF */ + conn_event(conn, FIN_RCVD(fromsidei)); + + /* We're either blocked or at EOF on the read side, and + * there's nothing in the pipe so there's nothing to do + * write side either. + */ + if (!conn->pending[fromsidei]) + break; + } else { conn->pending[fromsidei] += readlen;

if (readlen >= (long)c->tcp.pipe_size * 90 / 100) @@ -530,10 +538,6 @@ static int tcp_splice_forward(struct ctx *c, break;

conn->pending[fromsidei] -= written; - - if (conn->events & FIN_RCVD(fromsidei) && - !conn->pending[fromsidei]) - break;

The rest of the series looks good to me and I'm running tests now before pushing, but I can't convince myself of the correctness of this change. The first part makes sense as an additional condition to exit the loop and avoid an additional splice() call that would just return EAGAIN. But this one is a different condition because it happens to check conn->pending[fromsidei] right after we subtracted 'written' from it, and we know we have no input data anymore, so it avoids a useless (although I think harmless) read-side splice() call in the next iteration of the loop, doesn't it?

...

}

/* We need write-side wakeups if and only if we have data in the pipe to

-- Stefano

David Gibson

7:14 a.m.

New subject: [PATCH 7/8] tcp_splice: Remove questionable "optimisation" of pending bytes tracking

On Thu, Jun 04, 2026 at 06:41:43AM +0200, Stefano Brivio wrote:

...

On Thu, 28 May 2026 15:02:12 +1000 David Gibson wrote:

...
We have a special path that avoids updating conn->pending when the amounts read and written are equal. This has a conceptual complexity cost, in particular, it means that conn->pending[] is not accurate to its normal meaning for a section of the loop body.

conn->pending[] shares a cacheline with conn->pipe[] and conn->s[], so it's almost certainly cache-hot. It's questionable that avoiding the update of pending even outweighs the extra conditional branch, let alone saves anything of significance. Remove it.

I added this when we still had 64-bit counters, that is, two weeks before commit 37c228ada88b ("tap, tcp, udp, icmp: Cut down on some oversized buffers"), but now, as you point out, it doesn't make sense anymore.

Ah, that makes some sense. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

7:26 a.m.

New subject: [PATCH 8/8] tcp_splice: Exit forwarding earlier when stalled read side

On Thu, Jun 04, 2026 at 06:41:47AM +0200, Stefano Brivio wrote:

...

On Thu, 28 May 2026 15:02:13 +1000 David Gibson wrote:

...
At the end of our loop we have a conditional 'break' that exits if we're at EOF on the read side and have nothing left in the pipe. This doesn't depend on anything write-side, so we can move it earlier, avoiding an unnecessary write side splice in this case.

Furthermore, there's also nothing to be done write side if we've hit EAGAIN on the read side and the pipe is empty, so exit early for that case as well.

Signed-off-by: David Gibson --- tcp_splice.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c index 565596d3..623ca926 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -497,9 +497,17 @@ static int tcp_splice_forward(struct ctx *c,

flow_trace(conn, "%zi from read-side call", readlen);

- if (!readlen) { - conn_event(conn, FIN_RCVD(fromsidei)); - } else if (readlen > 0) { + if (readlen <= 0) { + if (!readlen) /* EOF */ + conn_event(conn, FIN_RCVD(fromsidei)); + + /* We're either blocked or at EOF on the read side, and + * there's nothing in the pipe so there's nothing to do + * write side either. + */ + if (!conn->pending[fromsidei]) + break; + } else { conn->pending[fromsidei] += readlen;

if (readlen >= (long)c->tcp.pipe_size * 90 / 100) @@ -530,10 +538,6 @@ static int tcp_splice_forward(struct ctx *c, break;

conn->pending[fromsidei] -= written; - - if (conn->events & FIN_RCVD(fromsidei) && - !conn->pending[fromsidei]) - break;

The rest of the series looks good to me and I'm running tests now before pushing, but I can't convince myself of the correctness of this change.

The first part makes sense as an additional condition to exit the loop and avoid an additional splice() call that would just return EAGAIN.

But this one is a different condition because it happens to check conn->pending[fromsidei] right after we subtracted 'written' from it, and we know we have no input data anymore, so it avoids a useless (although I think harmless) read-side splice() call in the next iteration of the loop, doesn't it?

Good point. If there was data in the pipe before the read-splice(), then we get EOF, then we empty the pipe with the write-splice() we know we're done and don't need to read-splice() again. I'm also pretty sure it's harmless, as long as repeated read-splice()s after EOF return either EOF again or EAGAIN, and I can't really see it doing anything else. Given that, and the fact you're already testing, I guess it makes more sense to do a fixup on top, rather than respinning? -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

7:14 a.m.

New subject: [PATCH 6/8] tcp_splice: Simplify / correct OUT_WAIT flag handling

On Thu, Jun 04, 2026 at 06:41:37AM +0200, Stefano Brivio wrote:

...

On Thu, 28 May 2026 15:02:11 +1000 David Gibson wrote:

...
We set the OUT_WAIT flag if we stop forwarding due to EAGAIN, but there's still data in the pipe. That ensures we wake up when the output socket has room to drain the pipe into.

We clear the OUT_WAIT flag when we complete forwarding on an EPOLLOUT event, but that's not quite right. Even though it's called on an EPOLLOUT, tcp_splice_forward() could, in principle empty the pipe, but also read enough new data from the other side to fill it again. That would set OUT_WAIT internally, but it would be cleared after returning meaning we could miss a necessary wakeup.

The current logic in tcp_splice_sock_handler():

if (events & EPOLLOUT) { if (tcp_splice_forward(c, conn, !evsidei, now)) goto reset; conn_event(conn, ~OUT_WAIT(evsidei)); }

if (events & EPOLLIN) { if (tcp_splice_forward(c, conn, evsidei, now)) goto reset; }

would prevent the case you described, because if we read new data from the other side filling the pipe, we'll hit (events & EPOLLIN) and set OUT_WAIT again if needed.

Nope. The (events & EPOLLIN) is an event on the same socket, forwarding in the opposite direction. The pipe would be refilled by data on the _other_ socket forwarding in the same direction. Now, _usually_ you'd then get an EPOLLIN on that other socket and that would trigger the wake up. But, this is actually a rare case where we might "miss" an event because we're using level not edge trigger (rather than the other way around). Consider just one direction of flow from socket A to socket B 1. epoll_wait() returns (just) an EPOLLOUT on socket B, nothing has arrived yet on socket A, so no EPOLLIN there. 2. Data arrives on socket A. 3. We reach tcp_splice_forward(), it empties the pipe, but refills it with the data that arrived in step (2). It happens that this also consumes all the data that arrived in (2) - we got exactly one pipe's worth of data. 4. We return from tcp_splice_forward() and clear OUT_WAIT. 5. We return to the epoll_wait(), but because we already read the data from socket A, and we're using level triggered events, we don't get an EPOLLIN 6. Space becomes available on socket B, but we don't get an EPOLLOUT, because OUT_WAIT is clear ...and we're stuck. Unlikely, but possible

...

But there's a case this should actually fix, even though I've never seen it happening in practice: what if we *don't* read new data from the other side, and we can't empty the pipe in one EPOLLOUT shot anyway?

I hadn't considered that before but if the receiver is slow enough that's probably possible.

True, that's probably more likely than the scenario above, actually.

...

...
The condition on whether we need write side wakeups is actually fairly simple: we need them if and only if we return to the main loop with data in the pipe. Maintain that in a single place - right after we exit the forwarding loop in tcp_splice_forward().

Signed-off-by: David Gibson --- tcp_splice.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c index 42902684..5f412584 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -531,19 +531,22 @@ static int tcp_splice_forward(struct ctx *c, conn->pending[fromsidei] += readlen > 0 ? readlen : 0; conn->pending[fromsidei] -= written > 0 ? written : 0;

- if (written < 0) { - if (!conn->pending[fromsidei]) - break; - - conn_event(conn, OUT_WAIT(!fromsidei)); + if (written < 0) break; - }

if (conn->events & FIN_RCVD(fromsidei) && !conn->pending[fromsidei]) break; }

+ /* We need write-side wakeups if and only if we have data in the pipe to + * drain. + */ + if (conn->pending[fromsidei]) + conn_event(conn, OUT_WAIT(!fromsidei)); + else + conn_event(conn, ~OUT_WAIT(!fromsidei)); + if ((conn->events & FIN_RCVD(fromsidei)) && !(conn->events & FIN_SENT(!fromsidei)) && !conn->pending[fromsidei]) { @@ -606,7 +609,6 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, if (events & EPOLLOUT) { if (tcp_splice_forward(c, conn, !evsidei, now)) goto reset; - conn_event(conn, ~OUT_WAIT(evsidei)); }

if (events & (EPOLLIN | EPOLLRDHUP)) {

-- Stefano

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

7:44 a.m.

New subject: [PATCH 8/8] tcp_splice: Exit forwarding earlier when stalled read side

On Thu, 4 Jun 2026 15:26:36 +1000 David Gibson wrote:

...

On Thu, Jun 04, 2026 at 06:41:47AM +0200, Stefano Brivio wrote:

...
On Thu, 28 May 2026 15:02:13 +1000 David Gibson wrote:

...
At the end of our loop we have a conditional 'break' that exits if we're at EOF on the read side and have nothing left in the pipe. This doesn't depend on anything write-side, so we can move it earlier, avoiding an unnecessary write side splice in this case.

Furthermore, there's also nothing to be done write side if we've hit EAGAIN on the read side and the pipe is empty, so exit early for that case as well.

Signed-off-by: David Gibson --- tcp_splice.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c index 565596d3..623ca926 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -497,9 +497,17 @@ static int tcp_splice_forward(struct ctx *c,

flow_trace(conn, "%zi from read-side call", readlen);

- if (!readlen) { - conn_event(conn, FIN_RCVD(fromsidei)); - } else if (readlen > 0) { + if (readlen <= 0) { + if (!readlen) /* EOF */ + conn_event(conn, FIN_RCVD(fromsidei)); + + /* We're either blocked or at EOF on the read side, and + * there's nothing in the pipe so there's nothing to do + * write side either. + */ + if (!conn->pending[fromsidei]) + break; + } else { conn->pending[fromsidei] += readlen;

if (readlen >= (long)c->tcp.pipe_size * 90 / 100) @@ -530,10 +538,6 @@ static int tcp_splice_forward(struct ctx *c, break;

conn->pending[fromsidei] -= written; - - if (conn->events & FIN_RCVD(fromsidei) && - !conn->pending[fromsidei]) - break;

The rest of the series looks good to me and I'm running tests now before pushing, but I can't convince myself of the correctness of this change.

The first part makes sense as an additional condition to exit the loop and avoid an additional splice() call that would just return EAGAIN.

But this one is a different condition because it happens to check conn->pending[fromsidei] right after we subtracted 'written' from it, and we know we have no input data anymore, so it avoids a useless (although I think harmless) read-side splice() call in the next iteration of the loop, doesn't it?

Good point. If there was data in the pipe before the read-splice(), then we get EOF, then we empty the pipe with the write-splice() we know we're done and don't need to read-splice() again.

I'm also pretty sure it's harmless, as long as repeated read-splice()s after EOF return either EOF again or EAGAIN, and I can't really see it doing anything else.

Given that, and the fact you're already testing, I guess it makes more sense to do a fixup on top, rather than respinning?

But the commit message of this one would still be misleading because we're not actually "moving" the condition, and the fix-up would just add back these four lines (I guess?), so I think it would be more practical if you could send a new version of 8/8 only (you can base it on 7/8). -- Stefano

David Gibson

9:08 a.m.

New subject: [PATCH 8/8] tcp_splice: Exit forwarding earlier when stalled read side

On Thu, Jun 04, 2026 at 07:44:56AM +0200, Stefano Brivio wrote:

...

On Thu, 4 Jun 2026 15:26:36 +1000 David Gibson wrote:

...
On Thu, Jun 04, 2026 at 06:41:47AM +0200, Stefano Brivio wrote:

...
On Thu, 28 May 2026 15:02:13 +1000 David Gibson wrote:

...
At the end of our loop we have a conditional 'break' that exits if we're at EOF on the read side and have nothing left in the pipe. This doesn't depend on anything write-side, so we can move it earlier, avoiding an unnecessary write side splice in this case.

Furthermore, there's also nothing to be done write side if we've hit EAGAIN on the read side and the pipe is empty, so exit early for that case as well.

Signed-off-by: David Gibson --- tcp_splice.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c index 565596d3..623ca926 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -497,9 +497,17 @@ static int tcp_splice_forward(struct ctx *c,

flow_trace(conn, "%zi from read-side call", readlen);

- if (!readlen) { - conn_event(conn, FIN_RCVD(fromsidei)); - } else if (readlen > 0) { + if (readlen <= 0) { + if (!readlen) /* EOF */ + conn_event(conn, FIN_RCVD(fromsidei)); + + /* We're either blocked or at EOF on the read side, and + * there's nothing in the pipe so there's nothing to do + * write side either. + */ + if (!conn->pending[fromsidei]) + break; + } else { conn->pending[fromsidei] += readlen;

if (readlen >= (long)c->tcp.pipe_size * 90 / 100) @@ -530,10 +538,6 @@ static int tcp_splice_forward(struct ctx *c, break;

conn->pending[fromsidei] -= written; - - if (conn->events & FIN_RCVD(fromsidei) && - !conn->pending[fromsidei]) - break;

The rest of the series looks good to me and I'm running tests now before pushing, but I can't convince myself of the correctness of this change.

The first part makes sense as an additional condition to exit the loop and avoid an additional splice() call that would just return EAGAIN.

But this one is a different condition because it happens to check conn->pending[fromsidei] right after we subtracted 'written' from it, and we know we have no input data anymore, so it avoids a useless (although I think harmless) read-side splice() call in the next iteration of the loop, doesn't it?

Good point. If there was data in the pipe before the read-splice(), then we get EOF, then we empty the pipe with the write-splice() we know we're done and don't need to read-splice() again.

I'm also pretty sure it's harmless, as long as repeated read-splice()s after EOF return either EOF again or EAGAIN, and I can't really see it doing anything else.

Given that, and the fact you're already testing, I guess it makes more sense to do a fixup on top, rather than respinning?

But the commit message of this one would still be misleading because we're not actually "moving" the condition, and the fix-up would just add back these four lines (I guess?), so I think it would be more practical if you could send a new version of 8/8 only (you can base it on 7/8).

Sure, will do. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

5 Jun 5 Jun

2:34 a.m.

New subject: [PATCH v2 8/8] tcp_splice: Improve EOF and read stall exit conditions

At the end of our loop we have a conditional 'break' that exits if we're at EOF on the read side and have nothing left in the pipe. This makes sense: at EOF there's nothing left to do read-side and with nothing in the pipe there's nothing to do write side either. The same is true if the read side hit an EAGAIN and the pipe is empty: there's nothing we can do (for now) read side, and with an empty pipe nothing write side either. So, generalise the condition to exit on either EOF or EAGAIN read side. Furthermore, if the read side is at EOF or EAGAIN and there's already nothing in the pipe before the write-side splice(), then that write side splice() can't accomplish anything, so exit the loop early in that case avoiding a harmless but unnecessary write-splice(). Signed-off-by: David Gibson --- tcp_splice.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) Changes in v2: * Duplicate rather than move the test, it's valuable in both places. * Make comments and commit message clearer diff --git a/tcp_splice.c b/tcp_splice.c index 565596d3..1e3c7749 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -497,9 +497,17 @@ static int tcp_splice_forward(struct ctx *c, flow_trace(conn, "%zi from read-side call", readlen); - if (!readlen) { - conn_event(conn, FIN_RCVD(fromsidei)); - } else if (readlen > 0) { + if (readlen <= 0) { + if (!readlen) /* EOF */ + conn_event(conn, FIN_RCVD(fromsidei)); + + /* We're either blocked or at EOF on the read side, and + * there's nothing in the pipe so there's nothing to do + * write side either. + */ + if (!conn->pending[fromsidei]) + break; + } else { conn->pending[fromsidei] += readlen; if (readlen >= (long)c->tcp.pipe_size * 90 / 100) @@ -531,9 +539,11 @@ static int tcp_splice_forward(struct ctx *c, conn->pending[fromsidei] -= written; - if (conn->events & FIN_RCVD(fromsidei) && - !conn->pending[fromsidei]) + if (!conn->pending[fromsidei] && readlen <= 0) { + /* Read side is EOF or EAGAIN, and we emptied the pipe. + * No more we can do for now, */ break; + } } /* We need write-side wakeups if and only if we have data in the pipe to -- 2.54.0

Stefano Brivio

8:28 a.m.

On Thu, 28 May 2026 15:02:05 +1000 David Gibson wrote:

...

As we discussed on an earlier call, while fixing bug 202 I noticed a number of warts in the surrounding splice() forwarding code. Most are just things that are longer or harder to follow than they need to be, but in some cases there may be real (if unlikely to trigger) bugs. Here's a collection of fixes.

David Gibson (8): tcp_splice: Remove never-invoked SO_RCVLOWAT logic tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling tcp_splice: Improve EOF exit condition for the loop tcp_splice: Remove goto from forwarding loop tcp_splice: Simplify shutdown(2) handling tcp_splice: Simplify / correct OUT_WAIT flag handling tcp_splice: Remove questionable "optimisation" of pending bytes tracking

Applied up to here (7/8).

...

tcp_splice: Exit forwarding earlier when stalled read side

tcp_splice.c | 100 ++++++++++++++++----------------------------------- 1 file changed, 31 insertions(+), 69 deletions(-)

-- Stefano

Stefano Brivio

12:59 p.m.

New subject: [PATCH v2 8/8] tcp_splice: Improve EOF and read stall exit conditions

On Fri, 5 Jun 2026 10:34:16 +1000 David Gibson wrote:

...

At the end of our loop we have a conditional 'break' that exits if we're at EOF on the read side and have nothing left in the pipe. This makes sense: at EOF there's nothing left to do read-side and with nothing in the pipe there's nothing to do write side either.

The same is true if the read side hit an EAGAIN and the pipe is empty: there's nothing we can do (for now) read side, and with an empty pipe nothing write side either. So, generalise the condition to exit on either EOF or EAGAIN read side.

Furthermore, if the read side is at EOF or EAGAIN and there's already nothing in the pipe before the write-side splice(), then that write side splice() can't accomplish anything, so exit the loop early in that case avoiding a harmless but unnecessary write-splice().

Signed-off-by: David Gibson --- tcp_splice.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-)

Changes in v2: * Duplicate rather than move the test, it's valuable in both places. * Make comments and commit message clearer

diff --git a/tcp_splice.c b/tcp_splice.c index 565596d3..1e3c7749 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -497,9 +497,17 @@ static int tcp_splice_forward(struct ctx *c,

flow_trace(conn, "%zi from read-side call", readlen);

- if (!readlen) { - conn_event(conn, FIN_RCVD(fromsidei)); - } else if (readlen > 0) { + if (readlen <= 0) { + if (!readlen) /* EOF */ + conn_event(conn, FIN_RCVD(fromsidei)); + + /* We're either blocked or at EOF on the read side, and + * there's nothing in the pipe so there's nothing to do + * write side either. + */ + if (!conn->pending[fromsidei]) + break; + } else { conn->pending[fromsidei] += readlen;

if (readlen >= (long)c->tcp.pipe_size * 90 / 100) @@ -531,9 +539,11 @@ static int tcp_splice_forward(struct ctx *c,

conn->pending[fromsidei] -= written;

- if (conn->events & FIN_RCVD(fromsidei) && - !conn->pending[fromsidei]) + if (!conn->pending[fromsidei] && readlen <= 0) { + /* Read side is EOF or EAGAIN, and we emptied the pipe. + * No more we can do for now, */

Changed "now," to "now.", added empty comment line, applied.

...

break; + } }

/* We need write-side wakeups if and only if we have data in the pipe to

-- Stefano

David Gibson

6:26 p.m.

New subject: [PATCH v2 8/8] tcp_splice: Improve EOF and read stall exit conditions

On Fri, Jun 05, 2026 at 12:59:42PM +0200, Stefano Brivio wrote:

...

On Fri, 5 Jun 2026 10:34:16 +1000 David Gibson wrote:

...
At the end of our loop we have a conditional 'break' that exits if we're at EOF on the read side and have nothing left in the pipe. This makes sense: at EOF there's nothing left to do read-side and with nothing in the pipe there's nothing to do write side either.

The same is true if the read side hit an EAGAIN and the pipe is empty: there's nothing we can do (for now) read side, and with an empty pipe nothing write side either. So, generalise the condition to exit on either EOF or EAGAIN read side.

Furthermore, if the read side is at EOF or EAGAIN and there's already nothing in the pipe before the write-side splice(), then that write side splice() can't accomplish anything, so exit the loop early in that case avoiding a harmless but unnecessary write-splice().

Signed-off-by: David Gibson --- tcp_splice.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-)

Changes in v2: * Duplicate rather than move the test, it's valuable in both places. * Make comments and commit message clearer

diff --git a/tcp_splice.c b/tcp_splice.c index 565596d3..1e3c7749 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -497,9 +497,17 @@ static int tcp_splice_forward(struct ctx *c,

flow_trace(conn, "%zi from read-side call", readlen);

- if (!readlen) { - conn_event(conn, FIN_RCVD(fromsidei)); - } else if (readlen > 0) { + if (readlen <= 0) { + if (!readlen) /* EOF */ + conn_event(conn, FIN_RCVD(fromsidei)); + + /* We're either blocked or at EOF on the read side, and + * there's nothing in the pipe so there's nothing to do + * write side either. + */ + if (!conn->pending[fromsidei]) + break; + } else { conn->pending[fromsidei] += readlen;

if (readlen >= (long)c->tcp.pipe_size * 90 / 100) @@ -531,9 +539,11 @@ static int tcp_splice_forward(struct ctx *c,

conn->pending[fromsidei] -= written;

- if (conn->events & FIN_RCVD(fromsidei) && - !conn->pending[fromsidei]) + if (!conn->pending[fromsidei] && readlen <= 0) { + /* Read side is EOF or EAGAIN, and we emptied the pipe. + * No more we can do for now, */

Changed "now," to "now.", added empty comment line, applied.

Oops, thanks.

...

...
break; + } }

/* We need write-side wakeups if and only if we have data in the pipe to

-- Stefano

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Age (days ago)

Last active (days ago)

List overview

Download

20 comments

2 participants

participants (2)

David Gibson
Stefano Brivio

[PATCH 0/8] splice() forwarding cleanups

tags

participants (2)