[PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone

older
[PATCH v2 0/2] Reconstruct ICMP...

Stefano Brivio

16 Feb 2025 16 Feb '25

11:12 p.m.

In commit e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation"), this: if (!bitmap_isset(rcvlowat_set, conn - ts) && readlen > (long)c->tcp.pipe_size / 10) { (note the !) became: if (conn->flags & lowat_set_flag && readlen > (long)c->tcp.pipe_size / 10) { in the new tcp_splice_sock_handler(). We want to check, there, if we should set SO_RCVLOWAT, only if we haven't set it already. But, instead, we're checking if it's already set before we set it, so we'll never set it, of course. Fix the check and re-enable the functionality, which should give us improved CPU utilisation in non-interactive cases where we are not transferring at full pipe capacity. Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") Signed-off-by: Stefano Brivio --- tcp_splice.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp_splice.c b/tcp_splice.c index 8a39a6f..5d845c9 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -556,7 +556,7 @@ eintr: if (readlen >= (long)c->tcp.pipe_size * 10 / 100) continue; - if (conn->flags & lowat_set_flag && + if (!(conn->flags & lowat_set_flag) && readlen > (long)c->tcp.pipe_size / 10) { int lowat = c->tcp.pipe_size / 4; -- 2.43.0

Show replies by date

Stefano Brivio

16 Feb 16 Feb

11:12 p.m.

New subject: [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere

If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a given flow, it means that we're blocked, waiting for the receiver to actually receive data, with a full pipe. In that case, if we keep EPOLLIN set for the socket on the other side (our receiving side), we'll get into a loop such as: 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 leading to 100% CPU usage, of course. Drop EPOLLIN on our receiving side as long when we're waiting for output readiness on the other side. Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584 Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_ro... Signed-off-by: Stefano Brivio --- tcp_splice.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index f1a9223..8a39a6f 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events, ev[1].events = EPOLLOUT; } - flow_foreach_sidei(sidei) - ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0; + flow_foreach_sidei(sidei) { + if (events & OUT_WAIT(sidei)) { + ev[sidei].events |= EPOLLOUT; + ev[!sidei].events &= ~EPOLLIN; + } + } } /** -- 2.43.0

David Gibson

17 Feb 17 Feb

4:51 a.m.

New subject: [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere

On Sun, Feb 16, 2025 at 11:12:16PM +0100, Stefano Brivio wrote:

...

If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a given flow, it means that we're blocked, waiting for the receiver to actually receive data, with a full pipe.

In that case, if we keep EPOLLIN set for the socket on the other side (our receiving side), we'll get into a loop such as:

41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577

leading to 100% CPU usage, of course.

Drop EPOLLIN on our receiving side as long when we're waiting for output readiness on the other side.

Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584 Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_ro... Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...

--- tcp_splice.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c index f1a9223..8a39a6f 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events, ev[1].events = EPOLLOUT; }

- flow_foreach_sidei(sidei) - ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0; + flow_foreach_sidei(sidei) { + if (events & OUT_WAIT(sidei)) { + ev[sidei].events |= EPOLLOUT; + ev[!sidei].events &= ~EPOLLIN; + } + } }

/**

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

4:49 a.m.

On Sun, Feb 16, 2025 at 11:12:15PM +0100, Stefano Brivio wrote:

...

In commit e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation"), this:

if (!bitmap_isset(rcvlowat_set, conn - ts) && readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

if (conn->flags & lowat_set_flag && readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we haven't set it already.

But, instead, we're checking if it's already set before we set it, so we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us improved CPU utilisation in non-interactive cases where we are not transferring at full pipe capacity.

Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") Signed-off-by: Stefano Brivio

Ouch. Reviewed-by: David Gibson At least insofar as this clearly corrects towards the intended behaviour. Given that we inadvertently bee using RCVLOWAT for so long, I am a bit worried that this might expose deadlocks or stalls. But, I guess we debug that when we come to it.

...

--- tcp_splice.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp_splice.c b/tcp_splice.c index 8a39a6f..5d845c9 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -556,7 +556,7 @@ eintr: if (readlen >= (long)c->tcp.pipe_size * 10 / 100) continue;

- if (conn->flags & lowat_set_flag && + if (!(conn->flags & lowat_set_flag) && readlen > (long)c->tcp.pipe_size / 10) { int lowat = c->tcp.pipe_size / 4;

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

8:12 a.m.

On Mon, 17 Feb 2025 14:49:39 +1100 David Gibson wrote:

...

On Sun, Feb 16, 2025 at 11:12:15PM +0100, Stefano Brivio wrote:

...
In commit e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation"), this:

if (!bitmap_isset(rcvlowat_set, conn - ts) && readlen > (long)c->tcp.pipe_size / 10) {

(note the !) became:

if (conn->flags & lowat_set_flag && readlen > (long)c->tcp.pipe_size / 10) {

in the new tcp_splice_sock_handler().

We want to check, there, if we should set SO_RCVLOWAT, only if we haven't set it already.

But, instead, we're checking if it's already set before we set it, so we'll never set it, of course.

Fix the check and re-enable the functionality, which should give us improved CPU utilisation in non-interactive cases where we are not transferring at full pipe capacity.

Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") Signed-off-by: Stefano Brivio

Ouch.

Reviewed-by: David Gibson

At least insofar as this clearly corrects towards the intended behaviour. Given that we inadvertently bee using RCVLOWAT for so long, I am a bit worried that this might expose deadlocks or stalls. But, I guess we debug that when we come to it.

Yeah, I was undecided as well, then I tested and tested, and I realised that commit 904b86ade7db ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes") added this gem, still there: if (read >= (long)c->tcp.pipe_size * 10 / 100) continue; if (!bitmap_isset(rcvlowat_set, conn - ts) && read > (long)c->tcp.pipe_size / 10) { int lowat = c->tcp.pipe_size / 4; setsockopt(move_from, SOL_SOCKET, SO_RCVLOWAT, &lowat, sizeof(lowat)); which means that we'll not set SO_RCVLOWAT anyway, because if read > c->tcp.pipe_size / 10, we'll skip the second block, and if not, we'll skip it anyway. Now, I have a clear memory of characterising those 10% and 25% values over a wide range of pipe sizes, message sizes, etc. Other than SSH and installing packages in a container (and check that nothing gets stuck for ~one second), my basic test idea was: $ iperf3 -c localhost -p 5202 -l 100 with port 5202 forwarded to the network namespace and iperf3 server listening there. I think I added another typo while cleaning up. I probably meant: [...] read < (long)c->tcp.pipe_size / 10) { ...and if I do that, it finally works. But anyway, it's too many typos at this point and especially we never had a release with it enabled, so I'm not "fixing" this for the moment, it needs a lot more testing than I can do now. -- Stefano

213

Age (days ago)

214

Last active (days ago)

List overview

Download

4 comments

2 participants

participants (2)

David Gibson
Stefano Brivio