[PATCH 0/6] RFC: Clean up tap-side event handling

newer
[PATCH] tcp: Use EPOLLET for any...

David Gibson

3 Sep 2024 3 Sep '24

2:02 p.m.

This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime. David Gibson (6): tap: Split out handling of EPOLLIN events tap: Improve handling of EINTR in tap_passt_input() tap: Restructure in tap_pasta_input() tap: Don't risk truncating frames on full buffer in tap_pasta_input() tap: Re-introduce EPOLLET for tap connections tap: Stub EPOLLOUT handling tap.c | 131 +++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 88 insertions(+), 43 deletions(-) -- 2.46.0

Show replies by date

David Gibson

3 Sep 3 Sep

2:02 p.m.

New subject: [PATCH 1/6] tap: Split out handling of EPOLLIN events

Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction. Signed-off-by: David Gibson --- tap.c | 50 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 14 deletions(-) diff --git a/tap.c b/tap.c index 852d8376..14c88871 100644 --- a/tap.c +++ b/tap.c @@ -982,24 +982,17 @@ static void tap_sock_reset(struct ctx *c) } /** - * tap_handler_passt() - Packet handler for AF_UNIX file descriptor + * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context - * @events: epoll events * @now: Current timestamp */ -void tap_handler_passt(struct ctx *c, uint32_t events, - const struct timespec *now) +static void tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; ssize_t n; char *p; - if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) { - tap_sock_reset(c); - return; - } - tap_flush_pools(); if (partial_len) { @@ -1052,20 +1045,33 @@ void tap_handler_passt(struct ctx *c, uint32_t events, } /** - * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor + * tap_handler_passt() - Event handler for AF_UNIX file descriptor * @c: Execution context * @events: epoll events * @now: Current timestamp */ -void tap_handler_pasta(struct ctx *c, uint32_t events, +void tap_handler_passt(struct ctx *c, uint32_t events, const struct timespec *now) +{ + if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) { + tap_sock_reset(c); + return; + } + + if (events & EPOLLIN) + tap_passt_input(c, now); +} + +/** + * tap_passt_input() - Handler for new data on the socket to qemu + * @c: Execution context + * @now: Current timestamp + */ +static void tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; int ret; - if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) - die("Disconnect event on /dev/net/tun device, exiting"); - redo: n = 0; @@ -1102,6 +1108,22 @@ restart: die("Error on tap device, exiting"); } +/** + * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor + * @c: Execution context + * @events: epoll events + * @now: Current timestamp + */ +void tap_handler_pasta(struct ctx *c, uint32_t events, + const struct timespec *now) +{ + if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) + die("Disconnect event on /dev/net/tun device, exiting"); + + if (events & EPOLLIN) + tap_pasta_input(c, now); +} + /** * tap_sock_unix_open() - Create and bind AF_UNIX socket * @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix) -- 2.46.0

Stefano Brivio

9:25 p.m.

New subject: [PATCH 1/6] tap: Split out handling of EPOLLIN events

On Tue, 3 Sep 2024 22:02:30 +1000 David Gibson wrote:

...

Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction.

Signed-off-by: David Gibson --- tap.c | 50 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 14 deletions(-)

diff --git a/tap.c b/tap.c index 852d8376..14c88871 100644 --- a/tap.c +++ b/tap.c @@ -982,24 +982,17 @@ static void tap_sock_reset(struct ctx *c) }

/** - * tap_handler_passt() - Packet handler for AF_UNIX file descriptor + * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context - * @events: epoll events * @now: Current timestamp */ -void tap_handler_passt(struct ctx *c, uint32_t events, - const struct timespec *now) +static void tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; ssize_t n; char *p;

- if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) { - tap_sock_reset(c); - return; - } - tap_flush_pools();

if (partial_len) { @@ -1052,20 +1045,33 @@ void tap_handler_passt(struct ctx *c, uint32_t events, }

/** - * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor + * tap_handler_passt() - Event handler for AF_UNIX file descriptor * @c: Execution context * @events: epoll events * @now: Current timestamp */ -void tap_handler_pasta(struct ctx *c, uint32_t events, +void tap_handler_passt(struct ctx *c, uint32_t events, const struct timespec *now) +{ + if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) { + tap_sock_reset(c); + return; + } + + if (events & EPOLLIN) + tap_passt_input(c, now); +} + +/** + * tap_passt_input() - Handler for new data on the socket to qemu

tap_pasta_input(), QEMU (we could just call it hypervisor, given that libkrun/krun also use this path).

...

+ * @c: Execution context + * @now: Current timestamp + */ +static void tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; int ret;

- if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) - die("Disconnect event on /dev/net/tun device, exiting"); - redo: n = 0;

@@ -1102,6 +1108,22 @@ restart: die("Error on tap device, exiting"); }

+/** + * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor + * @c: Execution context + * @events: epoll events + * @now: Current timestamp + */ +void tap_handler_pasta(struct ctx *c, uint32_t events, + const struct timespec *now) +{ + if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) + die("Disconnect event on /dev/net/tun device, exiting"); + + if (events & EPOLLIN) + tap_pasta_input(c, now); +} + /** * tap_sock_unix_open() - Create and bind AF_UNIX socket * @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)

-- Stefano

David Gibson

4 Sep 4 Sep

3:17 a.m.

New subject: [PATCH 1/6] tap: Split out handling of EPOLLIN events

On Tue, Sep 03, 2024 at 09:25:35PM +0200, Stefano Brivio wrote:

...

On Tue, 3 Sep 2024 22:02:30 +1000 David Gibson wrote:

...
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction.

Signed-off-by: David Gibson [snip] +/** + * tap_passt_input() - Handler for new data on the socket to qemu

tap_pasta_input(), QEMU (we could just call it hypervisor, given that libkrun/krun also use this path).

Updated. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

3 Sep 3 Sep

2:02 p.m.

New subject: [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input()

When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling. So, handle EINTR separately by immediately retrying until we succeed or get a different type of error. Signed-off-by: David Gibson --- tap.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/tap.c b/tap.c index 14c88871..9ee59faa 100644 --- a/tap.c +++ b/tap.c @@ -1003,10 +1003,13 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) memmove(pkt_buf, partial_frame, partial_len); } - n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len, - MSG_DONTWAIT); + do { + n = recv(c->fd_tap, pkt_buf + partial_len, + TAP_BUF_BYTES - partial_len, MSG_DONTWAIT); + } while ((n < 0) && errno == EINTR); + if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + if (errno != EAGAIN && errno != EWOULDBLOCK) { err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); } -- 2.46.0

Stefano Brivio

9:25 p.m.

New subject: [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input()

On Tue, 3 Sep 2024 22:02:31 +1000 David Gibson wrote:

...

When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling.

So, handle EINTR separately by immediately retrying until we succeed or get a different type of error.

I don't see an actual improvement: we don't know why we would get EINTR (because of signals) so repeating the recv() right away isn't necessarily a better choice. I'd say whatever way we have of carrying on on EINTR is fine. If this patch helps with brevity for the next ones, it makes sense, but otherwise I don't see a real advantage. Well, it's more consistent with the way we handle EINTR on other calls, but that's about it.

...

Signed-off-by: David Gibson --- tap.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tap.c b/tap.c index 14c88871..9ee59faa 100644 --- a/tap.c +++ b/tap.c @@ -1003,10 +1003,13 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) memmove(pkt_buf, partial_frame, partial_len); }

- n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len, - MSG_DONTWAIT); + do { + n = recv(c->fd_tap, pkt_buf + partial_len, + TAP_BUF_BYTES - partial_len, MSG_DONTWAIT); + } while ((n < 0) && errno == EINTR); + if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + if (errno != EAGAIN && errno != EWOULDBLOCK) { err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); }

-- Stefano

David Gibson

4 Sep 4 Sep

3:30 a.m.

New subject: [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input()

On Tue, Sep 03, 2024 at 09:25:39PM +0200, Stefano Brivio wrote:

...

On Tue, 3 Sep 2024 22:02:31 +1000 David Gibson wrote:

...
When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling.

So, handle EINTR separately by immediately retrying until we succeed or get a different type of error.

I don't see an actual improvement: we don't know why we would get EINTR (because of signals) so repeating the recv() right away isn't necessarily a better choice.

My understanding is that EINTR means the system call was aborted because a signal handler happened during it. It usually doesn't matter why we got the signal - we still need to redo the system call and we might as well do so immediately.

...

I'd say whatever way we have of carrying on on EINTR is fine. If this patch helps with brevity for the next ones, it makes sense, but otherwise I don't see a real advantage.

Right. Specifically in order to use EPOLLET, I need to _not_ treat EINTR the same way as EAGAIN and EWOULDBLOCK.

...

Well, it's more consistent with the way we handle EINTR on other calls, but that's about it.

That too. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

3 Sep 3 Sep

2:02 p.m.

New subject: [PATCH 3/6] tap: Restructure in tap_pasta_input()

tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue. The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop. Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case Signed-off-by: David Gibson --- tap.c | 42 +++++++++++++++++------------------------- 1 file changed, 17 insertions(+), 25 deletions(-) diff --git a/tap.c b/tap.c index 9ee59faa..71742748 100644 --- a/tap.c +++ b/tap.c @@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events, static void tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; - int ret; - -redo: - n = 0; tap_flush_pools(); -restart: - while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) { - if (len < (ssize_t)sizeof(struct ethhdr) || - len > (ssize_t)ETH_MAX_MTU) { - n += len; - continue; + for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { + len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + + if (len <= 0) { + if (errno == EINTR) { + len = 0; + continue; + } + break; } + /* Ignore frames of bad length */ + if (len < (ssize_t)sizeof(struct ethhdr) || + len > (ssize_t)ETH_MAX_MTU) + continue; tap_add_packet(c, len, pkt_buf + n); - - if ((n += len) == TAP_BUF_BYTES) - break; } - if (len < 0 && errno == EINTR) - goto restart; - - ret = errno; + if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK) + die("Error on tap device, exiting"); + else if (len == 0) + die("EOF on tap device, exiting"); tap_handler(c, now); - - if (len > 0 || ret == EAGAIN) - return; - - if (n == TAP_BUF_BYTES) - goto redo; - - die("Error on tap device, exiting"); } /** -- 2.46.0

Stefano Brivio

9:25 p.m.

New subject: [PATCH 3/6] tap: Restructure in tap_pasta_input()

On Tue, 3 Sep 2024 22:02:32 +1000 David Gibson wrote:

...

tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue.

The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop.

Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case

Signed-off-by: David Gibson --- tap.c | 42 +++++++++++++++++------------------------- 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/tap.c b/tap.c index 9ee59faa..71742748 100644 --- a/tap.c +++ b/tap.c @@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events, static void tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; - int ret; - -redo: - n = 0;

tap_flush_pools(); -restart: - while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {

- if (len < (ssize_t)sizeof(struct ethhdr) || - len > (ssize_t)ETH_MAX_MTU) { - n += len; - continue; + for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { + len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + + if (len <= 0) { + if (errno == EINTR) {

You might be checking errno when read() returns 0, in which case it's not set. I guess you should zero out errno before read() if you want to keep this, or, alternatively, directly handle len == 0 here.

...

+ len = 0; + continue; + } + break; }

+ /* Ignore frames of bad length */ + if (len < (ssize_t)sizeof(struct ethhdr) || + len > (ssize_t)ETH_MAX_MTU) + continue;

tap_add_packet(c, len, pkt_buf + n); - - if ((n += len) == TAP_BUF_BYTES) - break; }

- if (len < 0 && errno == EINTR) - goto restart; - - ret = errno; + if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK) + die("Error on tap device, exiting"); + else if (len == 0) + die("EOF on tap device, exiting");

tap_handler(c, now); - - if (len > 0 || ret == EAGAIN) - return; - - if (n == TAP_BUF_BYTES) - goto redo; - - die("Error on tap device, exiting"); }

/**

-- Stefano

David Gibson

4 Sep 4 Sep

3:33 a.m.

New subject: [PATCH 3/6] tap: Restructure in tap_pasta_input()

On Tue, Sep 03, 2024 at 09:25:42PM +0200, Stefano Brivio wrote:

...

On Tue, 3 Sep 2024 22:02:32 +1000 David Gibson wrote:

...
tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue.

The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop.

Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case

Signed-off-by: David Gibson --- tap.c | 42 +++++++++++++++++------------------------- 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/tap.c b/tap.c index 9ee59faa..71742748 100644 --- a/tap.c +++ b/tap.c @@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events, static void tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; - int ret; - -redo: - n = 0;

tap_flush_pools(); -restart: - while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {

- if (len < (ssize_t)sizeof(struct ethhdr) || - len > (ssize_t)ETH_MAX_MTU) { - n += len; - continue; + for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { + len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + + if (len <= 0) { + if (errno == EINTR) {

You might be checking errno when read() returns 0, in which case it's not set. I guess you should zero out errno before read() if you want to keep this, or, alternatively, directly handle len == 0 here.

Good catch. I've re-organised to avoid that.

...

...
+ len = 0; + continue; + } + break; }

+ /* Ignore frames of bad length */ + if (len < (ssize_t)sizeof(struct ethhdr) || + len > (ssize_t)ETH_MAX_MTU) + continue;

tap_add_packet(c, len, pkt_buf + n); - - if ((n += len) == TAP_BUF_BYTES) - break; }

- if (len < 0 && errno == EINTR) - goto restart; - - ret = errno; + if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK) + die("Error on tap device, exiting"); + else if (len == 0) + die("EOF on tap device, exiting");

tap_handler(c, now); - - if (len > 0 || ret == EAGAIN) - return; - - if (n == TAP_BUF_BYTES) - goto redo; - - die("Error on tap device, exiting"); }

/**

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

3 Sep 3 Sep

2:02 p.m.

New subject: [PATCH 4/6] tap: Don't risk truncating frames on full buffer in tap_pasta_input()

tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder. Adjust the code to make sure we always have room for a maximum size frame. Signed-off-by: David Gibson --- tap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tap.c b/tap.c index 71742748..2fbcef04 100644 --- a/tap.c +++ b/tap.c @@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) tap_flush_pools(); - for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { - len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) { + len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU); if (len <= 0) { if (errno == EINTR) { -- 2.46.0

Stefano Brivio

9:25 p.m.

New subject: [PATCH 4/6] tap: Don't risk truncating frames on full buffer in tap_pasta_input()

On Tue, 3 Sep 2024 22:02:33 +1000 David Gibson wrote:

...

tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder.

Adjust the code to make sure we always have room for a maximum size frame.

Signed-off-by: David Gibson --- tap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tap.c b/tap.c index 71742748..2fbcef04 100644 --- a/tap.c +++ b/tap.c @@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)

tap_flush_pools();

- for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { - len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {

Logically speaking, this should be TAP_BUF_BYTES - ETH_MAX_MTU + 1, I guess, because if we have ETH_MAX_MTU bytes left, we can read another frame. Only if we have strictly less than that we might truncate one. In any case, not a strong preference, and perhaps this version is actually more readable.

...

+ len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);

if (len <= 0) { if (errno == EINTR) {

-- Stefano

David Gibson

4 Sep 4 Sep

3:33 a.m.

New subject: [PATCH 4/6] tap: Don't risk truncating frames on full buffer in tap_pasta_input()

On Tue, Sep 03, 2024 at 09:25:46PM +0200, Stefano Brivio wrote:

...

On Tue, 3 Sep 2024 22:02:33 +1000 David Gibson wrote:

...
tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder.

Adjust the code to make sure we always have room for a maximum size frame.

Signed-off-by: David Gibson --- tap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tap.c b/tap.c index 71742748..2fbcef04 100644 --- a/tap.c +++ b/tap.c @@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)

tap_flush_pools();

- for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { - len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {

Logically speaking, this should be TAP_BUF_BYTES - ETH_MAX_MTU + 1, I guess, because if we have ETH_MAX_MTU bytes left, we can read another frame. Only if we have strictly less than that we might truncate one.

Good point. I've changed the < to <= to correct.

...

In any case, not a strong preference, and perhaps this version is actually more readable.

...
+ len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);

if (len <= 0) { if (errno == EINTR) {

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

3 Sep 3 Sep

2:02 p.m.

New subject: [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections

Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only used level-triggered events for the tap device. Prior to that we used it inconsistently which caused some problems. We want to add support for EPOLLOUT events on the tap connection, and without EPOLLET that would require toggling EPOLLOUT on and off, which is awkward. So, re-introduce EPOLLET, but now use it uniformly for all tap modes. The main change this requires is making sure on EPOLLIN we loop until all there's no more data to process. Signed-off-by: David Gibson --- tap.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/tap.c b/tap.c index 2fbcef04..d7d3fc19 100644 --- a/tap.c +++ b/tap.c @@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c) * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_passt_input(struct ctx *c, const struct timespec *now) +static bool tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; @@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); } - return; + return false; } p = pkt_buf; @@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { err("Bad frame size from guest, resetting connection"); tap_sock_reset(c); - return; + return false; } if (l2len + sizeof(uint32_t) > (size_t)n) @@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) partial_frame = p; tap_handler(c, now); + + return true; } /** @@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events, } if (events & EPOLLIN) - tap_passt_input(c, now); + while (tap_passt_input(c, now)) + ; } /** * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_pasta_input(struct ctx *c, const struct timespec *now) +static bool tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; @@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) die("EOF on tap device, exiting"); tap_handler(c, now); + + return len > 0; } /** @@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, die("Disconnect event on /dev/net/tun device, exiting"); if (events & EPOLLIN) - tap_pasta_input(c, now); + while (tap_pasta_input(c, now)) + ; } /** @@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events) trace("tap: failed to set SO_SNDBUF to %i", v); ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c) pasta_ns_conf(c); ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c) else ref.type = EPOLL_TYPE_TAP_PASTA; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); return; -- 2.46.0

Stefano Brivio

9:25 p.m.

New subject: [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections

On Tue, 3 Sep 2024 22:02:34 +1000 David Gibson wrote:

...

Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only used level-triggered events for the tap device. Prior to that we used it inconsistently which caused some problems.

It didn't actually cause any issue according to your commit message for 4684f603446b itself.

...

We want to add support for EPOLLOUT events on the tap connection, and without EPOLLET that would require toggling EPOLLOUT on and off, which is awkward. So, re-introduce EPOLLET, but now use it uniformly for all tap modes. The main change this requires is making sure on EPOLLIN we loop until all there's no more data to process.

Signed-off-by: David Gibson --- tap.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/tap.c b/tap.c index 2fbcef04..d7d3fc19 100644 --- a/tap.c +++ b/tap.c @@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c) * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_passt_input(struct ctx *c, const struct timespec *now) +static bool tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; @@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); } - return; + return false; }

p = pkt_buf; @@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { err("Bad frame size from guest, resetting connection"); tap_sock_reset(c); - return; + return false; }

if (l2len + sizeof(uint32_t) > (size_t)n) @@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) partial_frame = p;

tap_handler(c, now); + + return true; }

/** @@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events, }

if (events & EPOLLIN) - tap_passt_input(c, now); + while (tap_passt_input(c, now)) + ;

Nit (same below): use curly brackets for multi-line block. Or just use: while (tap_passt_input(c, now));

...

}

/** * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_pasta_input(struct ctx *c, const struct timespec *now) +static bool tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len;

@@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) die("EOF on tap device, exiting");

tap_handler(c, now); + + return len > 0; }

/** @@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, die("Disconnect event on /dev/net/tun device, exiting");

if (events & EPOLLIN) - tap_pasta_input(c, now); + while (tap_pasta_input(c, now)) + ; }

/** @@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events) trace("tap: failed to set SO_SNDBUF to %i", v);

ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c) pasta_ns_conf(c);

ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c) else ref.type = EPOLL_TYPE_TAP_PASTA;

- ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); return;

-- Stefano

David Gibson

4 Sep 4 Sep

3:36 a.m.

New subject: [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections

On Tue, Sep 03, 2024 at 09:25:50PM +0200, Stefano Brivio wrote:

...

On Tue, 3 Sep 2024 22:02:34 +1000 David Gibson wrote:

...
Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only used level-triggered events for the tap device. Prior to that we used it inconsistently which caused some problems.

It didn't actually cause any issue according to your commit message for 4684f603446b itself.

I've reworded this.

...

...
We want to add support for EPOLLOUT events on the tap connection, and without EPOLLET that would require toggling EPOLLOUT on and off, which is awkward. So, re-introduce EPOLLET, but now use it uniformly for all tap modes. The main change this requires is making sure on EPOLLIN we loop until all there's no more data to process.

Signed-off-by: David Gibson --- tap.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/tap.c b/tap.c index 2fbcef04..d7d3fc19 100644 --- a/tap.c +++ b/tap.c @@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c) * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_passt_input(struct ctx *c, const struct timespec *now) +static bool tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; @@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); } - return; + return false; }

p = pkt_buf; @@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { err("Bad frame size from guest, resetting connection"); tap_sock_reset(c); - return; + return false; }

if (l2len + sizeof(uint32_t) > (size_t)n) @@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) partial_frame = p;

tap_handler(c, now); + + return true; }

/** @@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events, }

if (events & EPOLLIN) - tap_passt_input(c, now); + while (tap_passt_input(c, now)) + ;

Nit (same below): use curly brackets for multi-line block. Or just use:

while (tap_passt_input(c, now));

Fixed.

...

...
}

/** * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_pasta_input(struct ctx *c, const struct timespec *now) +static bool tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len;

@@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) die("EOF on tap device, exiting");

tap_handler(c, now); + + return len > 0; }

/** @@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, die("Disconnect event on /dev/net/tun device, exiting");

if (events & EPOLLIN) - tap_pasta_input(c, now); + while (tap_pasta_input(c, now)) + ; }

/** @@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events) trace("tap: failed to set SO_SNDBUF to %i", v);

ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c) pasta_ns_conf(c);

ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c) else ref.type = EPOLL_TYPE_TAP_PASTA;

- ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); return;

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

3 Sep 3 Sep

2:02 p.m.

New subject: [PATCH 6/6] tap: Stub EPOLLOUT handling

Process EPOLLOUT events. For now this is just a stub, and needs to be extended to call into the various protocols to actually do something useful. Signed-off-by: David Gibson --- tap.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/tap.c b/tap.c index d7d3fc19..3754fffe 100644 --- a/tap.c +++ b/tap.c @@ -932,6 +932,18 @@ void tap_handler(struct ctx *c, const struct timespec *now) tap6_handler(c, pool_tap6, now); } +/** + * tap_ready() - Notify the rest of passt that the tap device is ready for more data + * @c: Execution context + * @now: Current timestamp + */ +static void tap_ready(struct ctx *c, const struct timespec *now) +{ + (void)c; + (void)now; + /* TODO: notify protocols */ +} + /** * tap_add_packet() - Queue/capture packet, update notion of guest MAC address * @c: Execution context @@ -1068,6 +1080,9 @@ void tap_handler_passt(struct ctx *c, uint32_t events, if (events & EPOLLIN) while (tap_passt_input(c, now)) ; + + if (events & EPOLLOUT) + tap_ready(c, now); } /** @@ -1127,6 +1142,9 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, if (events & EPOLLIN) while (tap_pasta_input(c, now)) ; + + if (events & EPOLLOUT) + tap_ready(c, now); } /** -- 2.46.0

Stefano Brivio

9:25 p.m.

On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson wrote:

...

This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.

Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement. I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything. That is, once/if we come up with fixes for those, as they might involve setting different event masks, I'd rather have those in *before* this series, to avoid further noise in case we manage to break something else with those hypothetical fixes. -- Stefano

David Gibson

4 Sep 4 Sep

5:17 a.m.

On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:

...

On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson wrote:

...
This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.

Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.

I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.

I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP. I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.

...

That is, once/if we come up with fixes for those, as they might involve setting different event masks, I'd rather have those in *before* this series, to avoid further noise in case we manage to break something else with those hypothetical fixes.

Right, I understand the impetus. Although as I said I find the current TCP event handling nigh-incomprehensible so I'm not as yet confident we can find a small fix without cleaning up the event handling more generally. That said, these changes to tap side event handling are a prerequisite / preliminary and shouldn't as yet really alter the TCP event flow. So I don't think this series will of itself make bisection harder, although follow on things based on it might. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

7:19 p.m.

On Wed, 4 Sep 2024 13:17:53 +1000 David Gibson wrote:

...

On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:

...
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson wrote:

...
This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.

Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.

I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.

I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.

I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at: https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from: https://github.com/containers/podman/issues/23686#issuecomment-2324945010 I shared this other one on top: https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae

...

I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.

Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.

...

...
That is, once/if we come up with fixes for those, as they might involve setting different event masks, I'd rather have those in *before* this series, to avoid further noise in case we manage to break something else with those hypothetical fixes.

Right, I understand the impetus. Although as I said I find the current TCP event handling nigh-incomprehensible so I'm not as yet confident we can find a small fix without cleaning up the event handling more generally.

I'm not sure either, but I don't think we have any indication, at the moment, that any of the issues from those two tickets have anything to do with TCP event handling (minus the one you tentatively fixed).

...

That said, these changes to tap side event handling are a prerequisite / preliminary and shouldn't as yet really alter the TCP event flow. So I don't think this series will of itself make bisection harder, although follow on things based on it might.

I understand that they shouldn't alter it, but if we missed something subtle and they actually do, they'll make bisection more complicated. If this series is only needed for switching TCP sockets to EPOLLET (well, minus 4/6, which is a fix on its own), maybe we could wait until you have the whole thing ready (and, hopefully, we manage to fix those two tickets meanwhile)? -- Stefano

David Gibson

5 Sep 5 Sep

2:35 a.m.

On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:

...

On Wed, 4 Sep 2024 13:17:53 +1000 David Gibson wrote:

...
On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:

...
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson wrote:

...
This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.

Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.

I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.

I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.

I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at:

https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc

Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from:

https://github.com/containers/podman/issues/23686#issuecomment-2324945010

I shared this other one on top:

https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae

Ah, nice.

...

...
I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.

Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.

Uh.. I'm confused. In what way would we not notice issues, other than the issues not existing which.. would be good, right?

...

...
...
That is, once/if we come up with fixes for those, as they might involve setting different event masks, I'd rather have those in *before* this series, to avoid further noise in case we manage to break something else with those hypothetical fixes.

Right, I understand the impetus. Although as I said I find the current TCP event handling nigh-incomprehensible so I'm not as yet confident we can find a small fix without cleaning up the event handling more generally.

I'm not sure either, but I don't think we have any indication, at the moment, that any of the issues from those two tickets have anything to do with TCP event handling (minus the one you tentatively fixed).

Right, this reasoning is pretty much specific to the EPOLLRDHUP storm. I may have written some of the descriptions before registering that the EPOLLERR storm was UDP and therefore unrelated.

...

...
That said, these changes to tap side event handling are a prerequisite / preliminary and shouldn't as yet really alter the TCP event flow. So I don't think this series will of itself make bisection harder, although follow on things based on it might.

I understand that they shouldn't alter it, but if we missed something subtle and they actually do, they'll make bisection more complicated.

I guess. Seems pretty unlikely to me given this doesn't touch the TCP events themselves.

...

If this series is only needed for switching TCP sockets to EPOLLET (well, minus 4/6, which is a fix on its own), maybe we could wait until you have the whole thing ready (and, hopefully, we manage to fix those two tickets meanwhile)?

Right, I'm ok to wait on this until I have the whole picture including TCP event masks as well. That's kind of why it's an RFC. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

10:32 a.m.

On Thu, 5 Sep 2024 10:35:14 +1000 David Gibson wrote:

...

On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:

...
On Wed, 4 Sep 2024 13:17:53 +1000 David Gibson wrote:

...
On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:

...
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson wrote:

...
This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.

Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.

I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.

I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.

I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at:

https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc

Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from:

https://github.com/containers/podman/issues/23686#issuecomment-2324945010

I shared this other one on top:

https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae

Ah, nice.

...
...
I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.

Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.

Uh.. I'm confused. In what way would we not notice issues, other than the issues not existing which.. would be good, right?

Right now, we have a condition where we fail to handle EPOLLRDHUP before an outbound connection is established, see https://github.com/containers/podman/issues/23686#issuecomment-233023742, and we end up in a tight event processing loop. I guess what we're missing in tcp_sock_handler() is clear (we should orderly close the connection), but the tight loop didn't happen on 2024_06_24.1ee2eca (I'm bisecting right now) and we don't know why it didn't. If we set EPOLLET, we won't see that anymore, because the EPOLLRDHUP event is reported just once, but that doesn't mean we solved this. -- Stefano

Stefano Brivio

1:33 p.m.

On Thu, 5 Sep 2024 10:32:38 +0200 Stefano Brivio wrote:

...

On Thu, 5 Sep 2024 10:35:14 +1000 David Gibson wrote:

...
On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:

...
On Wed, 4 Sep 2024 13:17:53 +1000 David Gibson wrote:

...
On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:

...
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson wrote:

...
This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.

Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.

I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.

I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.

I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at:

https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc

Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from:

https://github.com/containers/podman/issues/23686#issuecomment-2324945010

I shared this other one on top:

https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae

Ah, nice.

...
...
I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.

Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.

Uh.. I'm confused. In what way would we not notice issues, other than the issues not existing which.. would be good, right?

Right now, we have a condition where we fail to handle EPOLLRDHUP before an outbound connection is established, see https://github.com/containers/podman/issues/23686#issuecomment-233023742,

Sorry, it's: https://github.com/containers/podman/issues/23686#issuecomment-2330237424

...

and we end up in a tight event processing loop.

I guess what we're missing in tcp_sock_handler() is clear (we should orderly close the connection), but the tight loop didn't happen on 2024_06_24.1ee2eca (I'm bisecting right now) and we don't know why it didn't.

If we set EPOLLET, we won't see that anymore, because the EPOLLRDHUP event is reported just once, but that doesn't mean we solved this.

-- Stefano

491

Age (days ago)

493

Last active (days ago)

List overview

Download

22 comments

2 participants

participants (2)

David Gibson
Stefano Brivio

[PATCH 0/6] RFC: Clean up tap-side event handling

tags

participants (2)