[PATCH] flow: Set EPOLLFD_ID_DEFAULT on newly allocated flows, not EPOLLFD_ID_INVALID
We're somehow hitting:
ASSERTION FAILED in flow_epollfd (flow.c:362): f->epollid < ((1 << 8) - 1)
on an inbound spliced connection, with a single forwarded port, an
HTTP server in a Podman container, and a GET request. Reproducer at
https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734...
printf 'FROM registry.fedoraproject.org/fedora:latest\nRUN /usr/bin/dnf install -y httpd\nEXPOSE 80\nCMD ["-D", "FOREGROUND"]\nENTRYPOINT ["/usr/sbin/httpd"]\n' > Containerfile
podman build -t fedora-httpd $(pwd)
podman run -d -p 8080:80 localhost/fedora-httpd
curl http://localhost:8080
I guess we don't set EPOLLFD_ID_DEFAULT early enough on inbound spliced
sockets for some reason and we get a socket event while we still have
EPOLLFD_ID_INVALID set.
As we're not really using epoll identifiers yet, set
EPOLLFD_ID_DEFAULT right away on newly allocated flows, while we
figure this out.
Link: https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734...
Signed-off-by: Stefano Brivio
On Mon, 8 Dec 2025 22:28:22 +0100
Stefano Brivio
We're somehow hitting:
ASSERTION FAILED in flow_epollfd (flow.c:362): f->epollid < ((1 << 8) - 1)
on an inbound spliced connection, with a single forwarded port, an HTTP server in a Podman container, and a GET request. Reproducer at https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734...
printf 'FROM registry.fedoraproject.org/fedora:latest\nRUN /usr/bin/dnf install -y httpd\nEXPOSE 80\nCMD ["-D", "FOREGROUND"]\nENTRYPOINT ["/usr/sbin/httpd"]\n' > Containerfile podman build -t fedora-httpd $(pwd) podman run -d -p 8080:80 localhost/fedora-httpd
I guess we don't set EPOLLFD_ID_DEFAULT early enough on inbound spliced sockets for some reason and we get a socket event while we still have EPOLLFD_ID_INVALID set.
As we're not really using epoll identifiers yet, set EPOLLFD_ID_DEFAULT right away on newly allocated flows, while we figure this out.
Link: https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734... Signed-off-by: Stefano Brivio
--- I just merged this, posting for awareness / review.
Ah, never mind, this makes it worse somehow: 5.6384: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 5.6384: Flow 0 (TCP connection (spliced)): ERROR on epoll_ctl(): No such file or directory ...still looking for a workaround / fix. -- Stefano
On Mon, Dec 08, 2025 at 10:54:00PM +0100, Stefano Brivio wrote:
On Mon, 8 Dec 2025 22:28:22 +0100 Stefano Brivio
wrote: We're somehow hitting:
ASSERTION FAILED in flow_epollfd (flow.c:362): f->epollid < ((1 << 8) - 1)
on an inbound spliced connection, with a single forwarded port, an HTTP server in a Podman container, and a GET request. Reproducer at https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734...
printf 'FROM registry.fedoraproject.org/fedora:latest\nRUN /usr/bin/dnf install -y httpd\nEXPOSE 80\nCMD ["-D", "FOREGROUND"]\nENTRYPOINT ["/usr/sbin/httpd"]\n' > Containerfile podman build -t fedora-httpd $(pwd) podman run -d -p 8080:80 localhost/fedora-httpd
I guess we don't set EPOLLFD_ID_DEFAULT early enough on inbound spliced sockets for some reason and we get a socket event while we still have EPOLLFD_ID_INVALID set.
As we're not really using epoll identifiers yet, set EPOLLFD_ID_DEFAULT right away on newly allocated flows, while we figure this out.
Link: https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734... Signed-off-by: Stefano Brivio
--- I just merged this, posting for awareness / review. Ah, never mind, this makes it worse somehow:
5.6384: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 5.6384: Flow 0 (TCP connection (spliced)): ERROR on epoll_ctl(): No such file or directory
Does this imply you managed to reproduce locally? You hadn't as of your comment a few after the one linked. I also haven't managed to reproduce this.
...still looking for a workaround / fix.
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 9 Dec 2025 10:36:01 +1100
David Gibson
On Mon, Dec 08, 2025 at 10:54:00PM +0100, Stefano Brivio wrote:
On Mon, 8 Dec 2025 22:28:22 +0100 Stefano Brivio
wrote: We're somehow hitting:
ASSERTION FAILED in flow_epollfd (flow.c:362): f->epollid < ((1 << 8) - 1)
on an inbound spliced connection, with a single forwarded port, an HTTP server in a Podman container, and a GET request. Reproducer at https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734...
printf 'FROM registry.fedoraproject.org/fedora:latest\nRUN /usr/bin/dnf install -y httpd\nEXPOSE 80\nCMD ["-D", "FOREGROUND"]\nENTRYPOINT ["/usr/sbin/httpd"]\n' > Containerfile podman build -t fedora-httpd $(pwd) podman run -d -p 8080:80 localhost/fedora-httpd
I guess we don't set EPOLLFD_ID_DEFAULT early enough on inbound spliced sockets for some reason and we get a socket event while we still have EPOLLFD_ID_INVALID set.
As we're not really using epoll identifiers yet, set EPOLLFD_ID_DEFAULT right away on newly allocated flows, while we figure this out.
Link: https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734... Signed-off-by: Stefano Brivio
--- I just merged this, posting for awareness / review. Ah, never mind, this makes it worse somehow:
5.6384: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 5.6384: Flow 0 (TCP connection (spliced)): ERROR on epoll_ctl(): No such file or directory
Does this imply you managed to reproduce locally? You hadn't as of your comment a few after the one linked. I also haven't managed to reproduce this.
Just simulate an error (that's not EINPROGRESS) on connect() in tcp_splice_connect(). Patch coming. -- Stefano
On Mon, Dec 08, 2025 at 10:54:00PM +0100, Stefano Brivio wrote:
On Mon, 8 Dec 2025 22:28:22 +0100 Stefano Brivio
wrote: We're somehow hitting:
ASSERTION FAILED in flow_epollfd (flow.c:362): f->epollid < ((1 << 8) - 1)
on an inbound spliced connection, with a single forwarded port, an HTTP server in a Podman container, and a GET request. Reproducer at https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734...
printf 'FROM registry.fedoraproject.org/fedora:latest\nRUN /usr/bin/dnf install -y httpd\nEXPOSE 80\nCMD ["-D", "FOREGROUND"]\nENTRYPOINT ["/usr/sbin/httpd"]\n' > Containerfile podman build -t fedora-httpd $(pwd) podman run -d -p 8080:80 localhost/fedora-httpd
I guess we don't set EPOLLFD_ID_DEFAULT early enough on inbound spliced sockets for some reason and we get a socket event while we still have EPOLLFD_ID_INVALID set.
As we're not really using epoll identifiers yet, set EPOLLFD_ID_DEFAULT right away on newly allocated flows, while we figure this out.
Link: https://bodhi.fedoraproject.org/updates/FEDORA-2025-93b4eb64c3#comment-44734... Signed-off-by: Stefano Brivio
--- I just merged this, posting for awareness / review. Ah, never mind, this makes it worse somehow:
5.6384: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 5.6384: Flow 0 (TCP connection (spliced)): ERROR on epoll_ctl(): No such file or directory
This makes sense: epollfd != EPOLLFD_ID_INVALID indicates that the flow's fds are already in the epoll (flow_in_epoll() will return true). With epollfd initialised to EPOLLFD_ID_DEFAULT, we'll attempt EPOLL_CTL_MOD on the very first tcp_splice_epoll_ctl(), having never added the fds to the epoll set, hence this error.
...still looking for a workaround / fix.
Could the flow - for some other reason - be closing almost immediately, before it even adds itself to the epoll? If that's the case, we could potentially trigger this in the (flag == CLOSING) section of conn_flag_do(). I haven't managed to reproduce, so I can't test this myself. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 9 Dec 2025 11:01:27 +1100
David Gibson
Could the flow - for some other reason - be closing almost immediately, before it even adds itself to the epoll? If that's the case, we could potentially trigger this in the (flag == CLOSING) section of conn_flag_do().
Yes, that's what happens, see my previous email. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio