The fundamental patch here is 3/5, which is a workaround for a rather surprising kernel behaviour we seem to be hitting. This all comes from the investigation around https://bugs.passt.top/show_bug.cgi?id=74. I can't hit stalls anymore and throughput looks finally good to me (~3.5gbps with 208 KiB rmem_max and wmem_max), but... please test. Stefano Brivio (5): tcp: Fix comment to tcp_sock_consume() tcp: Reset STALLED flag on ACK only, check for pending socket data tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag tcp, tap: Don't increase tap-side sequence counter for dropped frames passt.1: Add note about tuning rmem_max and wmem_max for throughput passt.1 | 33 +++++++++++++++++++++++++ tap.c | 10 +++++--- tap.h | 2 +- tcp.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++---------- 4 files changed, 102 insertions(+), 17 deletions(-) -- 2.39.2
Note that tcp_sock_consume() doesn't update ACK sequence counters anymore. Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index dd3142d..aa1c8c9 100644 --- a/tcp.c +++ b/tcp.c @@ -2103,7 +2103,7 @@ static void tcp_conn_from_tap(struct ctx *c, } /** - * tcp_sock_consume() - Consume (discard) data from buffer, update ACK sequence + * tcp_sock_consume() - Consume (discard) data from buffer * @conn: Connection pointer * @ack_seq: ACK sequence, host order * -- 2.39.2
On Sat, Sep 23, 2023 at 12:06:06AM +0200, Stefano Brivio wrote:Note that tcp_sock_consume() doesn't update ACK sequence counters anymore. Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>Reviewed-by: David Gibson <david(a)gibson.dropbear.id.au>--- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index dd3142d..aa1c8c9 100644 --- a/tcp.c +++ b/tcp.c @@ -2103,7 +2103,7 @@ static void tcp_conn_from_tap(struct ctx *c, } /** - * tcp_sock_consume() - Consume (discard) data from buffer, update ACK sequence + * tcp_sock_consume() - Consume (discard) data from buffer * @conn: Connection pointer * @ack_seq: ACK sequence, host order *-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating that we ran out of tap-side window space, or that all available socket data is already in flight -- better names welcome!) on any event: do that only if the first packet in a batch has the ACK flag set. Make sure we check for pending socket data when we reset it: reverting back to level-triggered epoll events, as tcp_epoll_ctl() does, isn't guaranteed to actually trigger a socket event. Further, note that the flag only makes sense once a connection is established, so move all this to the right place, which is convenient for the next patch, as we want to check if the STALLED flag was set before processing any new information about the window size advertised by the tap. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tcp.c b/tcp.c index aa1c8c9..5528e05 100644 --- a/tcp.c +++ b/tcp.c @@ -2572,8 +2572,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, if (th->ack && !(conn->events & ESTABLISHED)) tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); - conn_flag(c, conn, ~STALLED); - /* Establishing connection from socket */ if (conn->events & SOCK_ACCEPTED) { if (th->syn && th->ack && !th->fin) { @@ -2631,6 +2629,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, if (conn->seq_ack_to_tap != conn->seq_from_tap) ack_due = 1; + if ((conn->flags & STALLED) && th->ack) { + conn_flag(c, conn, ~STALLED); + tcp_data_from_sock(c, conn); + } + if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) { shutdown(conn->sock, SHUT_WR); conn_event(c, conn, SOCK_FIN_SENT); -- 2.39.2
I think the change itself here is sound, but I have some nits to pick with the description and reasoning. On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating that we ran out of tap-side window space, or that all available socket data is already in flight -- better names welcome!Hmm.. when you put it like that it makes me wonder if those two quite different conditions really need the same handling. Hrm.. I guess both conditions mean that we can't accept data from the socket, even if it's availble.) on any event: do that only if the first packet in a batch has the ACK flag set."First packet in a batch" may not be accurate here - we're looking at whichever packet we were up to before calling data_from_tap(). There could have been earlier packets in the receive batch that were already processed. This also raises the question of why the first data packet should be particularly privileged here. I'm wondering if what we really want to check is whether data_from_tap() advanced the ack pointer at all. I'm not clear on when the th->ack check would ever fail in practice: aren't the only normal packets in a TCP connection without ACK the initial SYN or an RST? We've handled the SYN case earlier, so should we just have a blanket case above this that if we get a packet with !ACK, we reset the connection?Make sure we check for pending socket data when we reset it: reverting back to level-triggered epoll events, as tcp_epoll_ctl() does, isn't guaranteed to actually trigger a socket event.Which sure seems like a kernel bug. Some weird edge conditions for edge-triggered seems expected, but this doesn't seem like valid level-triggered semantics. Hmmm... is toggling EPOLLET even what we want. IIUC, the heart of what's going on here is that we can't take more data from the socket until something happens on the tap side (either the window expands, or it acks some data). In which case should we be toggling EPOLLIN on the socket instead? That seems more explicitly to be saying to the socket side "we don't currently care if you have data available".Further, note that the flag only makes sense once a connection is established, so move all this to the right place, which is convenient for the next patch, as we want to check if the STALLED flag was set before processing any new information about the window size advertised by the tap. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tcp.c b/tcp.c index aa1c8c9..5528e05 100644 --- a/tcp.c +++ b/tcp.c @@ -2572,8 +2572,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, if (th->ack && !(conn->events & ESTABLISHED)) tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); - conn_flag(c, conn, ~STALLED); - /* Establishing connection from socket */ if (conn->events & SOCK_ACCEPTED) { if (th->syn && th->ack && !th->fin) { @@ -2631,6 +2629,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, if (conn->seq_ack_to_tap != conn->seq_from_tap) ack_due = 1; + if ((conn->flags & STALLED) && th->ack) { + conn_flag(c, conn, ~STALLED); + tcp_data_from_sock(c, conn); + } + if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) { shutdown(conn->sock, SHUT_WR); conn_event(c, conn, SOCK_FIN_SENT);-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Mon, 25 Sep 2023 13:07:24 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:I think the change itself here is sound, but I have some nits to pick with the description and reasoning. On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:Right. I mean, we can also call them differently... or maybe pick a name that reflects the outcome/handling instead of what happened.In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating that we ran out of tap-side window space, or that all available socket data is already in flight -- better names welcome!Hmm.. when you put it like that it makes me wonder if those two quite different conditions really need the same handling. Hrm.. I guess both conditions mean that we can't accept data from the socket, even if it's availble.Well, it depends on what we call "batch" -- here I meant the pool of packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool" would be more accurate.) on any event: do that only if the first packet in a batch has the ACK flag set."First packet in a batch" may not be accurate here - we're looking at whichever packet we were up to before calling data_from_tap(). There could have been earlier packets in the receive batch that were already processed.This also raises the question of why the first data packet should be particularly privileged here.No reason other than convenience, and yes, it can be subtly wrong.I'm wondering if what we really want to check is whether data_from_tap() advanced the ack pointer at all.Right, we probably should do that instead.I'm not clear on when the th->ack check would ever fail in practice: aren't the only normal packets in a TCP connection without ACK the initial SYN or an RST? We've handled the SYN case earlier, so should we just have a blanket case above this that if we get a packet with !ACK, we reset the connection?One thing that's legitimate (rarely seen, but I've seen it, I don't remember if the Linux kernel ever does that) is a segment without ACK, and without data, that just updates the window (for example after a zero window). If the sequence received/processed so far doesn't correspond to the latest sequence sent, omitting the ACK flag is useful so that the window update is not taken as duplicate ACK (that would trigger retransmission).Hmm, yes, and by doing a quick isolated test actually this seems to work as intended in the kernel. I should drop this and try again.Make sure we check for pending socket data when we reset it: reverting back to level-triggered epoll events, as tcp_epoll_ctl() does, isn't guaranteed to actually trigger a socket event.Which sure seems like a kernel bug. Some weird edge conditions for edge-triggered seems expected, but this doesn't seem like valid level-triggered semantics.Hmmm... is toggling EPOLLET even what we want. IIUC, the heart of what's going on here is that we can't take more data from the socket until something happens on the tap side (either the window expands, or it acks some data). In which case should we be toggling EPOLLIN on the socket instead? That seems more explicitly to be saying to the socket side "we don't currently care if you have data available".The reason was to act on EPOLLRDHUP at the same time. But well, we could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make more sense. For the moment being, we should probably try to see what happens if this patch simply moves conn_flag(c, conn, ~STALLED); to where 3/5 needs it (or directly incorporate that into 3/5). -- Stefano
On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:On Mon, 25 Sep 2023 13:07:24 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Sure, if we could think of one. Except, on second thoughts, I'm not sure my characterization is correct. If the tap side window is full then, indeed, we can't accept data from the socket. However if everything we have is in flight that doesn't mean we couldn't put more data into flight if it arrived. That consideration, together with the way we use MSG_PEEK possibly means that we fundamentally need to use edge-triggered interrupts - with the additional trickiness that entails - to avoid busy polling. Although even that only works if we get a new edge interrupt when data is added to a buffer that's been PEEKed but not TRUNCed. If that's not true the MSG_PEEK approach might be doomed :(.I think the change itself here is sound, but I have some nits to pick with the description and reasoning. On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:Right. I mean, we can also call them differently... or maybe pick a name that reflects the outcome/handling instead of what happened.In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating that we ran out of tap-side window space, or that all available socket data is already in flight -- better names welcome!Hmm.. when you put it like that it makes me wonder if those two quite different conditions really need the same handling. Hrm.. I guess both conditions mean that we can't accept data from the socket, even if it's availble.Uh.. I don't think that actually helps. Remember pools aren't queues. The point here is that is that the packet we're considering is not the first of the batch/pool/whatever, but the first of what's left.Well, it depends on what we call "batch" -- here I meant the pool of packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool" would be more accurate.) on any event: do that only if the first packet in a batch has the ACK flag set."First packet in a batch" may not be accurate here - we're looking at whichever packet we were up to before calling data_from_tap(). There could have been earlier packets in the receive batch that were already processed.Ok.This also raises the question of why the first data packet should be particularly privileged here.No reason other than convenience, and yes, it can be subtly wrong.I'm wondering if what we really want to check is whether data_from_tap() advanced the ack pointer at all.Right, we probably should do that instead.Ah, ok, I wasn't aware of that case.I'm not clear on when the th->ack check would ever fail in practice: aren't the only normal packets in a TCP connection without ACK the initial SYN or an RST? We've handled the SYN case earlier, so should we just have a blanket case above this that if we get a packet with !ACK, we reset the connection?One thing that's legitimate (rarely seen, but I've seen it, I don't remember if the Linux kernel ever does that) is a segment without ACK, and without data, that just updates the window (for example after a zero window). If the sequence received/processed so far doesn't correspond to the latest sequence sent, omitting the ACK flag is useful so that the window update is not taken as duplicate ACK (that would trigger retransmission).So specifically to mask EPOLLRDHUP as well? On the grounds that if we're still chewing on what we got already we don't yet care that we've reached the end, yes? So yes, explicitly masking both those makes more sense to me.. except that as above, I suspect we can't have level-triggered + MSG_PEEK + no busy polling all at once.Hmm, yes, and by doing a quick isolated test actually this seems to work as intended in the kernel. I should drop this and try again.Make sure we check for pending socket data when we reset it: reverting back to level-triggered epoll events, as tcp_epoll_ctl() does, isn't guaranteed to actually trigger a socket event.Which sure seems like a kernel bug. Some weird edge conditions for edge-triggered seems expected, but this doesn't seem like valid level-triggered semantics.Hmmm... is toggling EPOLLET even what we want. IIUC, the heart of what's going on here is that we can't take more data from the socket until something happens on the tap side (either the window expands, or it acks some data). In which case should we be toggling EPOLLIN on the socket instead? That seems more explicitly to be saying to the socket side "we don't currently care if you have data available".The reason was to act on EPOLLRDHUP at the same time. But well, we could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make more sense.For the moment being, we should probably try to see what happens if this patch simply moves conn_flag(c, conn, ~STALLED); to where 3/5 needs it (or directly incorporate that into 3/5).Ok. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 28 Sep 2023 11:48:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:Right, but that's why we set EPOLLET...On Mon, 25 Sep 2023 13:07:24 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Sure, if we could think of one. Except, on second thoughts, I'm not sure my characterization is correct. If the tap side window is full then, indeed, we can't accept data from the socket. However if everything we have is in flight that doesn't mean we couldn't put more data into flight if it arrived.I think the change itself here is sound, but I have some nits to pick with the description and reasoning. On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:Right. I mean, we can also call them differently... or maybe pick a name that reflects the outcome/handling instead of what happened.In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating that we ran out of tap-side window space, or that all available socket data is already in flight -- better names welcome!Hmm.. when you put it like that it makes me wonder if those two quite different conditions really need the same handling. Hrm.. I guess both conditions mean that we can't accept data from the socket, even if it's availble.That consideration, together with the way we use MSG_PEEK possibly means that we fundamentally need to use edge-triggered interrupts - with the additional trickiness that entails - to avoid busy polling. Although even that only works if we get a new edge interrupt when data is added to a buffer that's been PEEKed but not TRUNCed. If that's not true the MSG_PEEK approach might be doomed :(.without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(7): Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag [...] so yes, in general, when the socket has more data, we'll get another event. I didn't test this in an isolated case, perhaps we should, but from my memory it always worked. On the other hand, we could actually use EPOLLONESHOT in the other case, as an optimisation, when we're waiting for an ACK from the tap side.Right, yes, I actually meant the sub-pool starting from the index (now) given by the caller.Uh.. I don't think that actually helps. Remember pools aren't queues. The point here is that is that the packet we're considering is not the first of the batch/pool/whatever, but the first of what's left.Well, it depends on what we call "batch" -- here I meant the pool of packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool" would be more accurate.) on any event: do that only if the first packet in a batch has the ACK flag set."First packet in a batch" may not be accurate here - we're looking at whichever packet we were up to before calling data_from_tap(). There could have been earlier packets in the receive batch that were already processed.On a second thought, in that case, we just got a window update, so it's very reasonable to actually check again if we can send more. Hence the check on th->ack is bogus anyway.Ok.This also raises the question of why the first data packet should be particularly privileged here.No reason other than convenience, and yes, it can be subtly wrong.I'm wondering if what we really want to check is whether data_from_tap() advanced the ack pointer at all.Right, we probably should do that instead.Ah, ok, I wasn't aware of that case.I'm not clear on when the th->ack check would ever fail in practice: aren't the only normal packets in a TCP connection without ACK the initial SYN or an RST? We've handled the SYN case earlier, so should we just have a blanket case above this that if we get a packet with !ACK, we reset the connection?One thing that's legitimate (rarely seen, but I've seen it, I don't remember if the Linux kernel ever does that) is a segment without ACK, and without data, that just updates the window (for example after a zero window). If the sequence received/processed so far doesn't correspond to the latest sequence sent, omitting the ACK flag is useful so that the window update is not taken as duplicate ACK (that would trigger retransmission).Right.So specifically to mask EPOLLRDHUP as well? On the grounds that if we're still chewing on what we got already we don't yet care that we've reached the end, yes?Hmm, yes, and by doing a quick isolated test actually this seems to work as intended in the kernel. I should drop this and try again.Make sure we check for pending socket data when we reset it: reverting back to level-triggered epoll events, as tcp_epoll_ctl() does, isn't guaranteed to actually trigger a socket event.Which sure seems like a kernel bug. Some weird edge conditions for edge-triggered seems expected, but this doesn't seem like valid level-triggered semantics.Hmmm... is toggling EPOLLET even what we want. IIUC, the heart of what's going on here is that we can't take more data from the socket until something happens on the tap side (either the window expands, or it acks some data). In which case should we be toggling EPOLLIN on the socket instead? That seems more explicitly to be saying to the socket side "we don't currently care if you have data available".The reason was to act on EPOLLRDHUP at the same time. But well, we could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make more sense.So yes, explicitly masking both those makes more sense to me.. except that as above, I suspect we can't have level-triggered + MSG_PEEK + no busy polling all at once.Hmm, right, that's the other problem if we mask EPOLLIN: we won't get events on new data. I think EPOLLET is really what we need here, at least for the case where we are not necessarily waiting for an ACK. For the other case (window full), we can either mask EPOLLIN | EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated because we need to re-add the file descriptor). -- Stefano
On Fri, Sep 29, 2023 at 05:20:15PM +0200, Stefano Brivio wrote:On Thu, 28 Sep 2023 11:48:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok. That text does seem to suggest it works that way, although it's not entirely clear that it must always give new events.On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:Right, but that's why we set EPOLLET...On Mon, 25 Sep 2023 13:07:24 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Sure, if we could think of one. Except, on second thoughts, I'm not sure my characterization is correct. If the tap side window is full then, indeed, we can't accept data from the socket. However if everything we have is in flight that doesn't mean we couldn't put more data into flight if it arrived.I think the change itself here is sound, but I have some nits to pick with the description and reasoning. On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote: > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating > that we ran out of tap-side window space, or that all available > socket data is already in flight -- better names welcome! Hmm.. when you put it like that it makes me wonder if those two quite different conditions really need the same handling. Hrm.. I guess both conditions mean that we can't accept data from the socket, even if it's availble.Right. I mean, we can also call them differently... or maybe pick a name that reflects the outcome/handling instead of what happened.That consideration, together with the way we use MSG_PEEK possibly means that we fundamentally need to use edge-triggered interrupts - with the additional trickiness that entails - to avoid busy polling. Although even that only works if we get a new edge interrupt when data is added to a buffer that's been PEEKed but not TRUNCed. If that's not true the MSG_PEEK approach might be doomed :(.without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(7): Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag [...] so yes, in general, when the socket has more data, we'll get another event. I didn't test this in an isolated case, perhaps we should, but from my memory it always worked.On the other hand, we could actually use EPOLLONESHOT in the other case, as an optimisation, when we're waiting for an ACK from the tap side.Hrm.. I can't actually see a case were EPOLLONESHOT would be useful. By the time we know the receiver's window has been filled, we're already processing the last event that we'll be able to until the window opens again. Setting EPOLLONESHOT would be requesting one more event.Ok.Right, yes, I actually meant the sub-pool starting from the index (now) given by the caller.Uh.. I don't think that actually helps. Remember pools aren't queues. The point here is that is that the packet we're considering is not the first of the batch/pool/whatever, but the first of what's left.> ) on any > event: do that only if the first packet in a batch has the ACK flag > set. "First packet in a batch" may not be accurate here - we're looking at whichever packet we were up to before calling data_from_tap(). There could have been earlier packets in the receive batch that were already processed.Well, it depends on what we call "batch" -- here I meant the pool of packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool" would be more accurate.On a second thought, in that case, we just got a window update, so it's very reasonable to actually check again if we can send more. Hence the check on th->ack is bogus anyway.Ok.This also raises the question of why the first data packet should be particularly privileged here.No reason other than convenience, and yes, it can be subtly wrong.I'm wondering if what we really want to check is whether data_from_tap() advanced the ack pointer at all.Right, we probably should do that instead.Ah, ok, I wasn't aware of that case.I'm not clear on when the th->ack check would ever fail in practice: aren't the only normal packets in a TCP connection without ACK the initial SYN or an RST? We've handled the SYN case earlier, so should we just have a blanket case above this that if we get a packet with !ACK, we reset the connection?One thing that's legitimate (rarely seen, but I've seen it, I don't remember if the Linux kernel ever does that) is a segment without ACK, and without data, that just updates the window (for example after a zero window). If the sequence received/processed so far doesn't correspond to the latest sequence sent, omitting the ACK flag is useful so that the window update is not taken as duplicate ACK (that would trigger retransmission).Right.So specifically to mask EPOLLRDHUP as well? On the grounds that if we're still chewing on what we got already we don't yet care that we've reached the end, yes?> Make sure we check for pending socket data when we reset it: > reverting back to level-triggered epoll events, as tcp_epoll_ctl() > does, isn't guaranteed to actually trigger a socket event. Which sure seems like a kernel bug. Some weird edge conditions for edge-triggered seems expected, but this doesn't seem like valid level-triggered semantics.Hmm, yes, and by doing a quick isolated test actually this seems to work as intended in the kernel. I should drop this and try again.Hmmm... is toggling EPOLLET even what we want. IIUC, the heart of what's going on here is that we can't take more data from the socket until something happens on the tap side (either the window expands, or it acks some data). In which case should we be toggling EPOLLIN on the socket instead? That seems more explicitly to be saying to the socket side "we don't currently care if you have data available".The reason was to act on EPOLLRDHUP at the same time. But well, we could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make more sense.So, thinking further, what I think we want is to always set EPOLLET on the TCP sockets. If all received data is in flight we don't need anything special, at least assuming epoll works like we think it does above: when we get more data, we get an event and check if we can send more data into flight. When the receive window fills we really don't care about new data until it opens again, so clear EPOLLIN | EPOLLRDHUP. When the window does open again - i.e. when we get an ack or window update - both reenable EPOLLIN | EPOLLRDHUP and call sock_handler() to process anything that's accumulated since we turned EPOLLIN off. Details to figure out: * Do we need to be careful about order of re-enable EPOLLIN vs. rechecking the recv() buffer? * At what point would we trigger the CLAMP_WINDOW workaround in that scheme? * Is there any impact of EPOLLET on the other events? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonSo yes, explicitly masking both those makes more sense to me.. except that as above, I suspect we can't have level-triggered + MSG_PEEK + no busy polling all at once.Hmm, right, that's the other problem if we mask EPOLLIN: we won't get events on new data. I think EPOLLET is really what we need here, at least for the case where we are not necessarily waiting for an ACK. For the other case (window full), we can either mask EPOLLIN | EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated because we need to re-add the file descriptor).
On Tue, 3 Oct 2023 14:20:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Fri, Sep 29, 2023 at 05:20:15PM +0200, Stefano Brivio wrote:A glimpse at the code confirms that, but... yes, I think we should test this more specifically, perhaps even shipping that test case under doc/.On Thu, 28 Sep 2023 11:48:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok. That text does seem to suggest it works that way, although it's not entirely clear that it must always give new events.On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:Right, but that's why we set EPOLLET...On Mon, 25 Sep 2023 13:07:24 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > I think the change itself here is sound, but I have some nits to pick > with the description and reasoning. > > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote: > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating > > that we ran out of tap-side window space, or that all available > > socket data is already in flight -- better names welcome! > > Hmm.. when you put it like that it makes me wonder if those two quite > different conditions really need the same handling. Hrm.. I guess > both conditions mean that we can't accept data from the socket, even > if it's availble. Right. I mean, we can also call them differently... or maybe pick a name that reflects the outcome/handling instead of what happened.Sure, if we could think of one. Except, on second thoughts, I'm not sure my characterization is correct. If the tap side window is full then, indeed, we can't accept data from the socket. However if everything we have is in flight that doesn't mean we couldn't put more data into flight if it arrived.That consideration, together with the way we use MSG_PEEK possibly means that we fundamentally need to use edge-triggered interrupts - with the additional trickiness that entails - to avoid busy polling. Although even that only works if we get a new edge interrupt when data is added to a buffer that's been PEEKed but not TRUNCed. If that's not true the MSG_PEEK approach might be doomed :(.without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(7): Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag [...] so yes, in general, when the socket has more data, we'll get another event. I didn't test this in an isolated case, perhaps we should, but from my memory it always worked.Ah, true -- we should have it "always" set and always re-arm, which is messy and would probably kill any resemblance of high throughput.On the other hand, we could actually use EPOLLONESHOT in the other case, as an optimisation, when we're waiting for an ACK from the tap side.Hrm.. I can't actually see a case were EPOLLONESHOT would be useful. By the time we know the receiver's window has been filled, we're already processing the last event that we'll be able to until the window opens again. Setting EPOLLONESHOT would be requesting one more event....I think having EPOLLET always set causes races that are potentially unsolvable, because we can't always read from the socket until we get -EAGAIN. That is, this will work: - recv() all data, we can't write more data to tap - at some point, we can write again to tap - more data comes, we'll wake up and continue but this won't: - partial recv(), we can't write more data to tap - at some point, we can write again to tap - no additional data comesOk.Right, yes, I actually meant the sub-pool starting from the index (now) given by the caller.> > ) on any > > event: do that only if the first packet in a batch has the ACK flag > > set. > > "First packet in a batch" may not be accurate here - we're looking at > whichever packet we were up to before calling data_from_tap(). There > could have been earlier packets in the receive batch that were already > processed. Well, it depends on what we call "batch" -- here I meant the pool of packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool" would be more accurate.Uh.. I don't think that actually helps. Remember pools aren't queues. The point here is that is that the packet we're considering is not the first of the batch/pool/whatever, but the first of what's left.On a second thought, in that case, we just got a window update, so it's very reasonable to actually check again if we can send more. Hence the check on th->ack is bogus anyway.> This also raises the question of why the first data packet should be > particularly privileged here. No reason other than convenience, and yes, it can be subtly wrong. > I'm wondering if what we really want to > check is whether data_from_tap() advanced the ack pointer at all. Right, we probably should do that instead.Ok.> I'm not clear on when the th->ack check would ever fail in practice: > aren't the only normal packets in a TCP connection without ACK the > initial SYN or an RST? We've handled the SYN case earlier, so should > we just have a blanket case above this that if we get a packet with > !ACK, we reset the connection? One thing that's legitimate (rarely seen, but I've seen it, I don't remember if the Linux kernel ever does that) is a segment without ACK, and without data, that just updates the window (for example after a zero window). If the sequence received/processed so far doesn't correspond to the latest sequence sent, omitting the ACK flag is useful so that the window update is not taken as duplicate ACK (that would trigger retransmission).Ah, ok, I wasn't aware of that case.Right.> > Make sure we check for pending socket data when we reset it: > > reverting back to level-triggered epoll events, as tcp_epoll_ctl() > > does, isn't guaranteed to actually trigger a socket event. > > Which sure seems like a kernel bug. Some weird edge conditions for > edge-triggered seems expected, but this doesn't seem like valid > level-triggered semantics. Hmm, yes, and by doing a quick isolated test actually this seems to work as intended in the kernel. I should drop this and try again. > Hmmm... is toggling EPOLLET even what we want. IIUC, the heart of > what's going on here is that we can't take more data from the socket > until something happens on the tap side (either the window expands, or > it acks some data). In which case should we be toggling EPOLLIN on > the socket instead? That seems more explicitly to be saying to the > socket side "we don't currently care if you have data available". The reason was to act on EPOLLRDHUP at the same time. But well, we could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make more sense.So specifically to mask EPOLLRDHUP as well? On the grounds that if we're still chewing on what we got already we don't yet care that we've reached the end, yes?So, thinking further, what I think we want is to always set EPOLLET on the TCP sockets. If all received data is in flight we don't need anything special, at least assuming epoll works like we think it does above: when we get more data, we get an event and check if we can send more data into flight.So yes, explicitly masking both those makes more sense to me.. except that as above, I suspect we can't have level-triggered + MSG_PEEK + no busy polling all at once.Hmm, right, that's the other problem if we mask EPOLLIN: we won't get events on new data. I think EPOLLET is really what we need here, at least for the case where we are not necessarily waiting for an ACK. For the other case (window full), we can either mask EPOLLIN | EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated because we need to re-add the file descriptor).When the receive window fills we really don't care about new data until it opens again, so clear EPOLLIN | EPOLLRDHUP. When the window does open again - i.e. when we get an ack or window update - both reenable EPOLLIN | EPOLLRDHUP and call sock_handler() to process anything that's accumulated since we turned EPOLLIN off.Agreed. This should be a mere optimisation on top of the current behaviour, by the way.Details to figure out: * Do we need to be careful about order of re-enable EPOLLIN vs. rechecking the recv() buffer?It should be done before checking I guess, so that we can end up with one spurious event (because we already read the data a future EPOLLIN will tell us about), but never a missing event (because data comes between recv() and epoll_ctl()).* At what point would we trigger the CLAMP_WINDOW workaround in that scheme?When we read any data from the socket, with MSG_TRUNC, after the window full condition.* Is there any impact of EPOLLET on the other events?Not that I'm aware of. EPOLLHUP and EPOLLERR are reported anyway, and we don't want EPOLLRDHUP to differ (in this sense) from EPOLLIN. But again, this is under the assumption that we do *not* always set EPOLLET. -- Stefano
On Thu, Oct 05, 2023 at 08:18:49AM +0200, Stefano Brivio wrote:On Tue, 3 Oct 2023 14:20:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Seems wise.On Fri, Sep 29, 2023 at 05:20:15PM +0200, Stefano Brivio wrote:A glimpse at the code confirms that, but... yes, I think we should test this more specifically, perhaps even shipping that test case under doc/.On Thu, 28 Sep 2023 11:48:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok. That text does seem to suggest it works that way, although it's not entirely clear that it must always give new events.On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote: > On Mon, 25 Sep 2023 13:07:24 +1000 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > I think the change itself here is sound, but I have some nits to pick > > with the description and reasoning. > > > > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote: > > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating > > > that we ran out of tap-side window space, or that all available > > > socket data is already in flight -- better names welcome! > > > > Hmm.. when you put it like that it makes me wonder if those two quite > > different conditions really need the same handling. Hrm.. I guess > > both conditions mean that we can't accept data from the socket, even > > if it's availble. > > Right. I mean, we can also call them differently... or maybe pick a > name that reflects the outcome/handling instead of what happened. Sure, if we could think of one. Except, on second thoughts, I'm not sure my characterization is correct. If the tap side window is full then, indeed, we can't accept data from the socket. However if everything we have is in flight that doesn't mean we couldn't put more data into flight if it arrived.Right, but that's why we set EPOLLET...That consideration, together with the way we use MSG_PEEK possibly means that we fundamentally need to use edge-triggered interrupts - with the additional trickiness that entails - to avoid busy polling. Although even that only works if we get a new edge interrupt when data is added to a buffer that's been PEEKed but not TRUNCed. If that's not true the MSG_PEEK approach might be doomed :(.without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(7): Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag [...] so yes, in general, when the socket has more data, we'll get another event. I didn't test this in an isolated case, perhaps we should, but from my memory it always worked.Hrm.. I'm a little sceptical that we'd have unsolvable races when always using EPOLLET that we don't already have using it sometimes (though maybe with more subtle and complex triggering conditions). I feel like it's easier to reason through to avoid races if we stay in a single trigger mode.Ah, true -- we should have it "always" set and always re-arm, which is messy and would probably kill any resemblance of high throughput.On the other hand, we could actually use EPOLLONESHOT in the other case, as an optimisation, when we're waiting for an ACK from the tap side.Hrm.. I can't actually see a case were EPOLLONESHOT would be useful. By the time we know the receiver's window has been filled, we're already processing the last event that we'll be able to until the window opens again. Setting EPOLLONESHOT would be requesting one more event....I think having EPOLLET always set causes races that are potentially unsolvable,Ok.> > > ) on any > > > event: do that only if the first packet in a batch has the ACK flag > > > set. > > > > "First packet in a batch" may not be accurate here - we're looking at > > whichever packet we were up to before calling data_from_tap(). There > > could have been earlier packets in the receive batch that were already > > processed. > > Well, it depends on what we call "batch" -- here I meant the pool of > packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool" > would be more accurate. Uh.. I don't think that actually helps. Remember pools aren't queues. The point here is that is that the packet we're considering is not the first of the batch/pool/whatever, but the first of what's left.Right, yes, I actually meant the sub-pool starting from the index (now) given by the caller.> > This also raises the question of why the first data packet should be > > particularly privileged here. > > No reason other than convenience, and yes, it can be subtly wrong. > > > I'm wondering if what we really want to > > check is whether data_from_tap() advanced the ack pointer at all. > > Right, we probably should do that instead. Ok. > > I'm not clear on when the th->ack check would ever fail in practice: > > aren't the only normal packets in a TCP connection without ACK the > > initial SYN or an RST? We've handled the SYN case earlier, so should > > we just have a blanket case above this that if we get a packet with > > !ACK, we reset the connection? > > One thing that's legitimate (rarely seen, but I've seen it, I don't > remember if the Linux kernel ever does that) is a segment without ACK, > and without data, that just updates the window (for example after a > zero window). > > If the sequence received/processed so far doesn't correspond to the > latest sequence sent, omitting the ACK flag is useful so that the > window update is not taken as duplicate ACK (that would trigger > retransmission). Ah, ok, I wasn't aware of that case.On a second thought, in that case, we just got a window update, so it's very reasonable to actually check again if we can send more. Hence the check on th->ack is bogus anyway.> > > Make sure we check for pending socket data when we reset it: > > > reverting back to level-triggered epoll events, as tcp_epoll_ctl() > > > does, isn't guaranteed to actually trigger a socket event. > > > > Which sure seems like a kernel bug. Some weird edge conditions for > > edge-triggered seems expected, but this doesn't seem like valid > > level-triggered semantics. > > Hmm, yes, and by doing a quick isolated test actually this seems to work > as intended in the kernel. I should drop this and try again. > > > Hmmm... is toggling EPOLLET even what we want. IIUC, the heart of > > what's going on here is that we can't take more data from the socket > > until something happens on the tap side (either the window expands, or > > it acks some data). In which case should we be toggling EPOLLIN on > > the socket instead? That seems more explicitly to be saying to the > > socket side "we don't currently care if you have data available". > > The reason was to act on EPOLLRDHUP at the same time. But well, we > could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make > more sense. So specifically to mask EPOLLRDHUP as well? On the grounds that if we're still chewing on what we got already we don't yet care that we've reached the end, yes?Right.So, thinking further, what I think we want is to always set EPOLLET on the TCP sockets. If all received data is in flight we don't need anything special, at least assuming epoll works like we think it does above: when we get more data, we get an event and check if we can send more data into flight.So yes, explicitly masking both those makes more sense to me.. except that as above, I suspect we can't have level-triggered + MSG_PEEK + no busy polling all at once.Hmm, right, that's the other problem if we mask EPOLLIN: we won't get events on new data. I think EPOLLET is really what we need here, at least for the case where we are not necessarily waiting for an ACK. For the other case (window full), we can either mask EPOLLIN | EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated because we need to re-add the file descriptor).because we can't always read from the socket until we get -EAGAIN. That is, this will work: - recv() all data, we can't write more data to tap - at some point, we can write again to tap - more data comes, we'll wake up and continue but this won't: - partial recv(), we can't write more data to tap - at some point, we can write again to tap - no additional data comesWell.. with the other change I'm suggesting here, once the data we did send gets ACKed, we'll recheck the incoming socket and send more. It will delay that remainder data a bit, but not by _that_ much. Since in this case we're not getting more data from the socket, that shouldn't even be all that harmful to throughput. That said, it's not ideal. We could address that by enabling an EPOLLOUT (possibly ONESHOT) event on the tap socket if we get a partial send. When that event is triggered, we'd scan through connections for any unsent data.Well, it becomes an essential change if we always enable EPOLLET.When the receive window fills we really don't care about new data until it opens again, so clear EPOLLIN | EPOLLRDHUP. When the window does open again - i.e. when we get an ack or window update - both reenable EPOLLIN | EPOLLRDHUP and call sock_handler() to process anything that's accumulated since we turned EPOLLIN off.Agreed. This should be a mere optimisation on top of the current behaviour, by the way.Yes, I think that's right.Details to figure out: * Do we need to be careful about order of re-enable EPOLLIN vs. rechecking the recv() buffer?It should be done before checking I guess, so that we can end up with one spurious event (because we already read the data a future EPOLLIN will tell us about), but never a missing event (because data comes between recv() and epoll_ctl()).Well, yes, but what's the point at which we flag that the window full condition has occurred? Since that's occurring on the read side buffer, we're not directly aware of it.* At what point would we trigger the CLAMP_WINDOW workaround in that scheme?When we read any data from the socket, with MSG_TRUNC, after the window full condition.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson* Is there any impact of EPOLLET on the other events?Not that I'm aware of. EPOLLHUP and EPOLLERR are reported anyway, and we don't want EPOLLRDHUP to differ (in this sense) from EPOLLIN. But again, this is under the assumption that we do *not* always set EPOLLET.
It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway). Reported-by: Matej Hrica <mhrica(a)redhat.com> Link: https://bugs.passt.top/show_bug.cgi?id=74 Analysed-by: David Gibson <david(a)gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index 5528e05..4606f17 100644 --- a/tcp.c +++ b/tcp.c @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, wnd <<= conn->ws_from_tap; wnd = MIN(MAX_WINDOW, wnd); - if (conn->flags & WND_CLAMPED) { + /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up + * with a zero-sized window on a TCP socket, dropping data (once + * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear + * to be enough to make the kernel advertise a non-zero window to the + * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing + * value, fixes this. + * + * The STALLED flag on a connection is a sufficient indication that we + * might have a zero-sized window on the socket, because it's set if we + * exhausted the tap-side window, or if everything we receive from a + * socket is already in flight to the guest. + * + * So, if STALLED is set, and we received a window value from the tap, + * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated + * further and fixed in the kernel instead, if confirmed. + */ + if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) { if (prev_scaled == wnd) return; @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, i = keep - 1; } - tcp_clamp_window(c, conn, max_ack_seq_wnd); - /* On socket flush failure, pretend there was no ACK, try again later */ if (ack && !tcp_sock_consume(conn, max_ack_seq)) tcp_update_seqack_from_tap(c, conn, max_ack_seq); + tcp_clamp_window(c, conn, max_ack_seq_wnd); + if (retr) { trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap); -- 2.39.2
Oops, On Sat, 23 Sep 2023 00:06:08 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected^^^ sender- eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest.^^^ sender -- Stefano
On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expectedYou noted the s/receiver/sender/ here..- eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest...and the s/guest/sender/ here..I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway). Reported-by: Matej Hrica <mhrica(a)redhat.com> Link: https://bugs.passt.top/show_bug.cgi?id=74 Analysed-by: David Gibson <david(a)gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index 5528e05..4606f17 100644 --- a/tcp.c +++ b/tcp.c @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, wnd <<= conn->ws_from_tap; wnd = MIN(MAX_WINDOW, wnd); - if (conn->flags & WND_CLAMPED) { + /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up + * with a zero-sized window on a TCP socket, dropping data (once + * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear + * to be enough to make the kernel advertise a non-zero window to the + * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing.. but you need another s/receiver/sender/ here too.+ * value, fixes this. + * + * The STALLED flag on a connection is a sufficient indication that we + * might have a zero-sized window on the socket, because it's set if we + * exhausted the tap-side window, or if everything we receive from a + * socket is already in flight to the guest. + * + * So, if STALLED is set, and we received a window value from the tap, + * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated + * further and fixed in the kernel instead, if confirmed. + */ + if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) { if (prev_scaled == wnd) return; @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, i = keep - 1; } - tcp_clamp_window(c, conn, max_ack_seq_wnd); - /* On socket flush failure, pretend there was no ACK, try again later */ if (ack && !tcp_sock_consume(conn, max_ack_seq)) tcp_update_seqack_from_tap(c, conn, max_ack_seq); + tcp_clamp_window(c, conn, max_ack_seq_wnd); + if (retr) { trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap);-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway).So, I tested this, and things got a bit complicated. First, I reproduced the "read side" problem by setting net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB. The "160kiB" stall happened almost every time. Applying this patch appears to fix it completely, getting GiB/s throughput consistently. So, yah. Then I tried reproducing it differently: by setting both net.core.rmem_max and net.core.wmem_max to 16MiB, but setting SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which actually results in a 256kiB buffer, because of the kernel's weird interpretation). With the SO_RCVBUF clamp and without this patch, I don't get the 160kiB stall consistently any more. What I *do* get is nearly every time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps. Sometimes it stalls after several seconds. The stall is slightly different from the 160kiB stall though: the 160kiB stall seems 0 bytes transferred on both sides. With the RCVBUF stall I get a trickle of bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes per interval on the sender but occasionally an interval with several hundred KB. That is it seems like there's a buffer somewhere that's very slowly draining into the receiver, then getting topped up in an instant once it gets low enough. When I have both this patch and the RCVBUF clamp, I don't seem to be able to reproduce the trickle-stall anymore, but I still get the slow transfer speeds most, but not every time. Sometimes, but only rarely, I do seem to still get a complete stall (0 bytes on both sides). So it looks like there are two different things going on here. It looks like this patch fixes at least something, but there might be some more things to investigate afterwards. On that basis: Tested-by: David Gibson <david(a)gibson.dropbear.id.au> Reviewed-by: David Gibson <david(a)gibson.dropbear.id.au>Reported-by: Matej Hrica <mhrica(a)redhat.com> Link: https://bugs.passt.top/show_bug.cgi?id=74 Analysed-by: David Gibson <david(a)gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index 5528e05..4606f17 100644 --- a/tcp.c +++ b/tcp.c @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, wnd <<= conn->ws_from_tap; wnd = MIN(MAX_WINDOW, wnd); - if (conn->flags & WND_CLAMPED) { + /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up + * with a zero-sized window on a TCP socket, dropping data (once + * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear + * to be enough to make the kernel advertise a non-zero window to the + * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing + * value, fixes this. + * + * The STALLED flag on a connection is a sufficient indication that we + * might have a zero-sized window on the socket, because it's set if we + * exhausted the tap-side window, or if everything we receive from a + * socket is already in flight to the guest. + * + * So, if STALLED is set, and we received a window value from the tap, + * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated + * further and fixed in the kernel instead, if confirmed. + */ + if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) { if (prev_scaled == wnd) return; @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, i = keep - 1; } - tcp_clamp_window(c, conn, max_ack_seq_wnd); - /* On socket flush failure, pretend there was no ACK, try again later */ if (ack && !tcp_sock_consume(conn, max_ack_seq)) tcp_update_seqack_from_tap(c, conn, max_ack_seq); + tcp_clamp_window(c, conn, max_ack_seq_wnd); + if (retr) { trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap);-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:With the exception of the previously noted s/receiver/sender/ fixes, of course. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonIt looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway).So, I tested this, and things got a bit complicated. First, I reproduced the "read side" problem by setting net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB. The "160kiB" stall happened almost every time. Applying this patch appears to fix it completely, getting GiB/s throughput consistently. So, yah. Then I tried reproducing it differently: by setting both net.core.rmem_max and net.core.wmem_max to 16MiB, but setting SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which actually results in a 256kiB buffer, because of the kernel's weird interpretation). With the SO_RCVBUF clamp and without this patch, I don't get the 160kiB stall consistently any more. What I *do* get is nearly every time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps. Sometimes it stalls after several seconds. The stall is slightly different from the 160kiB stall though: the 160kiB stall seems 0 bytes transferred on both sides. With the RCVBUF stall I get a trickle of bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes per interval on the sender but occasionally an interval with several hundred KB. That is it seems like there's a buffer somewhere that's very slowly draining into the receiver, then getting topped up in an instant once it gets low enough. When I have both this patch and the RCVBUF clamp, I don't seem to be able to reproduce the trickle-stall anymore, but I still get the slow transfer speeds most, but not every time. Sometimes, but only rarely, I do seem to still get a complete stall (0 bytes on both sides). So it looks like there are two different things going on here. It looks like this patch fixes at least something, but there might be some more things to investigate afterwards. On that basis: Tested-by: David Gibson <david(a)gibson.dropbear.id.au> Reviewed-by: David Gibson <david(a)gibson.dropbear.id.au>
On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:I noted another oddity. With this patch, _no_ RCVBUF clamp and 16MiB wmem_max fixed, things seem to behave much better with a small rmem_max than large. With rmem_max=256KiB I get pretty consistent 37Gbps throughput and iperf3 -c reports 0 retransmits. With rmem_max=16MiB, the throughput fluctuates from second to second between ~3Gbps and ~30Gbps. The client reports retransmits in some intervals, which is pretty weird over lo. Urgh... so many variables. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonIt looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway).So, I tested this, and things got a bit complicated. First, I reproduced the "read side" problem by setting net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB. The "160kiB" stall happened almost every time. Applying this patch appears to fix it completely, getting GiB/s throughput consistently. So, yah. Then I tried reproducing it differently: by setting both net.core.rmem_max and net.core.wmem_max to 16MiB, but setting SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which actually results in a 256kiB buffer, because of the kernel's weird interpretation). With the SO_RCVBUF clamp and without this patch, I don't get the 160kiB stall consistently any more. What I *do* get is nearly every time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps. Sometimes it stalls after several seconds. The stall is slightly different from the 160kiB stall though: the 160kiB stall seems 0 bytes transferred on both sides. With the RCVBUF stall I get a trickle of bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes per interval on the sender but occasionally an interval with several hundred KB. That is it seems like there's a buffer somewhere that's very slowly draining into the receiver, then getting topped up in an instant once it gets low enough. When I have both this patch and the RCVBUF clamp, I don't seem to be able to reproduce the trickle-stall anymore, but I still get the slow transfer speeds most, but not every time. Sometimes, but only rarely, I do seem to still get a complete stall (0 bytes on both sides).
On Mon, 25 Sep 2023 14:21:47 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:This is probably due to the receive buffer getting bigger than TCP_FRAMES_MEM * MSS4 (or MSS6), so the amount of data we can read in one shot from the sockets isn't optimally sized anymore. We should have a look at the difference between not clamping at all (and if that yields the same throughput, great), and clamping to, I guess, TCP_FRAMES_MEM * MIN(MSS4, MSS6). -- StefanoOn Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:I noted another oddity. With this patch, _no_ RCVBUF clamp and 16MiB wmem_max fixed, things seem to behave much better with a small rmem_max than large. With rmem_max=256KiB I get pretty consistent 37Gbps throughput and iperf3 -c reports 0 retransmits. With rmem_max=16MiB, the throughput fluctuates from second to second between ~3Gbps and ~30Gbps. The client reports retransmits in some intervals, which is pretty weird over lo. Urgh... so many variables.It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway).So, I tested this, and things got a bit complicated. First, I reproduced the "read side" problem by setting net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB. The "160kiB" stall happened almost every time. Applying this patch appears to fix it completely, getting GiB/s throughput consistently. So, yah. Then I tried reproducing it differently: by setting both net.core.rmem_max and net.core.wmem_max to 16MiB, but setting SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which actually results in a 256kiB buffer, because of the kernel's weird interpretation). With the SO_RCVBUF clamp and without this patch, I don't get the 160kiB stall consistently any more. What I *do* get is nearly every time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps. Sometimes it stalls after several seconds. The stall is slightly different from the 160kiB stall though: the 160kiB stall seems 0 bytes transferred on both sides. With the RCVBUF stall I get a trickle of bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes per interval on the sender but occasionally an interval with several hundred KB. That is it seems like there's a buffer somewhere that's very slowly draining into the receiver, then getting topped up in an instant once it gets low enough. When I have both this patch and the RCVBUF clamp, I don't seem to be able to reproduce the trickle-stall anymore, but I still get the slow transfer speeds most, but not every time. Sometimes, but only rarely, I do seem to still get a complete stall (0 bytes on both sides).
On Wed, Sep 27, 2023 at 07:05:50PM +0200, Stefano Brivio wrote:On Mon, 25 Sep 2023 14:21:47 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hm, ok. Not really sure why that would cause such nasty behaviour.On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:This is probably due to the receive buffer getting bigger than TCP_FRAMES_MEM * MSS4 (or MSS6), so the amount of data we can read in one shot from the sockets isn't optimally sized anymore.On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:I noted another oddity. With this patch, _no_ RCVBUF clamp and 16MiB wmem_max fixed, things seem to behave much better with a small rmem_max than large. With rmem_max=256KiB I get pretty consistent 37Gbps throughput and iperf3 -c reports 0 retransmits. With rmem_max=16MiB, the throughput fluctuates from second to second between ~3Gbps and ~30Gbps. The client reports retransmits in some intervals, which is pretty weird over lo. Urgh... so many variables.It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway).So, I tested this, and things got a bit complicated. First, I reproduced the "read side" problem by setting net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB. The "160kiB" stall happened almost every time. Applying this patch appears to fix it completely, getting GiB/s throughput consistently. So, yah. Then I tried reproducing it differently: by setting both net.core.rmem_max and net.core.wmem_max to 16MiB, but setting SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which actually results in a 256kiB buffer, because of the kernel's weird interpretation). With the SO_RCVBUF clamp and without this patch, I don't get the 160kiB stall consistently any more. What I *do* get is nearly every time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps. Sometimes it stalls after several seconds. The stall is slightly different from the 160kiB stall though: the 160kiB stall seems 0 bytes transferred on both sides. With the RCVBUF stall I get a trickle of bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes per interval on the sender but occasionally an interval with several hundred KB. That is it seems like there's a buffer somewhere that's very slowly draining into the receiver, then getting topped up in an instant once it gets low enough. When I have both this patch and the RCVBUF clamp, I don't seem to be able to reproduce the trickle-stall anymore, but I still get the slow transfer speeds most, but not every time. Sometimes, but only rarely, I do seem to still get a complete stall (0 bytes on both sides).We should have a look at the difference between not clamping at all (and if that yields the same throughput, great), and clamping to, I guess, TCP_FRAMES_MEM * MIN(MSS4, MSS6).Right. We currently ask for the largest RCVBUF we can get, which might not really be what we want. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
...so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptors we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tap.c | 10 +++++++--- tap.h | 2 +- tcp.c | 43 ++++++++++++++++++++++++++++++++++++------- 3 files changed, 44 insertions(+), 11 deletions(-) diff --git a/tap.c b/tap.c index 93db989..b30ff81 100644 --- a/tap.c +++ b/tap.c @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c, * @c: Execution context * @iov: Array of buffers, each containing one frame (with L2 headers) * @n: Number of buffers/frames in @iov + * + * Return: number of frames actually sent */ -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) { size_t m; if (!n) - return; + return 0; if (c->mode == MODE_PASST) m = tap_send_frames_passt(c, iov, n); @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) m = tap_send_frames_pasta(c, iov, n); if (m < n) - debug("tap: dropped %lu frames of %lu due to short send", n - m, n); + debug("tap: failed to send %lu frames of %lu", n - m, n); pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0); + + return m; } /** diff --git a/tap.h b/tap.h index 021fb7c..952fafc 100644 --- a/tap.h +++ b/tap.h @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, void *in, size_t len); int tap_send(const struct ctx *c, const void *data, size_t len); -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); void tap_update_mac(struct tap_hdr *taph, const unsigned char *eth_d, const unsigned char *eth_s); void tap_listen_handler(struct ctx *c, uint32_t events); diff --git a/tcp.c b/tcp.c index 4606f17..76b7b8d 100644 --- a/tcp.c +++ b/tcp.c @@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; +/** + * tcp_buf_seq_update - Sequences to update with length of frames once sent + * @seq: Pointer to sequence number sent to tap-side, to be updated + * @len: TCP payload length + */ +struct tcp_buf_seq_update { + uint32_t *seq; + uint16_t len; +}; + /* Static buffers */ /** @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t { #endif tcp4_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp4_l2_buf_used; /** @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t { #endif tcp6_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp6_l2_buf_used; /* recvmsg()/sendmsg() data for tap */ @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c) */ static void tcp_l2_data_buf_flush(struct ctx *c) { - tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + unsigned i; + size_t m; + + m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + for (i = 0; i < m; i++) + *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len; tcp6_l2_buf_used = 0; - tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + for (i = 0; i < m; i++) + *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len; tcp4_l2_buf_used = 0; } @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq) * @plen: Payload length at L4 * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer * @seq: Sequence number to be sent - * @now: Current timestamp + * @seq_update: Pointer to sequence number to update on successful send */ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, - ssize_t plen, int no_csum, uint32_t seq) + ssize_t plen, int no_csum, uint32_t seq, + uint32_t *seq_update) { struct iovec *iov; @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used]; uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen; + iov = tcp4_l2_iov + tcp4_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, check, seq); @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, } else if (CONN_V6(conn)) { struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used]; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen; + iov = tcp6_l2_iov + tcp6_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, NULL, seq); @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent; + uint32_t already_sent, seq; struct iovec *iov; already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) /* Finally, queue to tap */ plen = mss; + seq = conn->seq_to_tap; for (i = 0; i < send_bufs; i++) { int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used; if (i == send_bufs - 1) plen = last_len; - tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap); - conn->seq_to_tap += plen; + tcp_data_to_tap(c, conn, plen, no_csum, seq, &conn->seq_to_tap); + seq += plen; } conn_flag(c, conn, ACK_FROM_TAP_DUE); -- 2.39.2
On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:...so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptorsI assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, yes? For AVX2 we have substantially more space here. Couldn't we put a conn (or seq) pointer in here at the cost of a few bytes MSS for non-AVX2 and zero cost for AVX2 (which is probably the majority case)?we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tap.c | 10 +++++++--- tap.h | 2 +- tcp.c | 43 ++++++++++++++++++++++++++++++++++++------- 3 files changed, 44 insertions(+), 11 deletions(-) diff --git a/tap.c b/tap.c index 93db989..b30ff81 100644 --- a/tap.c +++ b/tap.c @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c, * @c: Execution context * @iov: Array of buffers, each containing one frame (with L2 headers) * @n: Number of buffers/frames in @iov + * + * Return: number of frames actually sent */ -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) { size_t m; if (!n) - return; + return 0; if (c->mode == MODE_PASST) m = tap_send_frames_passt(c, iov, n); @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) m = tap_send_frames_pasta(c, iov, n); if (m < n) - debug("tap: dropped %lu frames of %lu due to short send", n - m, n); + debug("tap: failed to send %lu frames of %lu", n - m, n); pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0); + + return m; } /** diff --git a/tap.h b/tap.h index 021fb7c..952fafc 100644 --- a/tap.h +++ b/tap.h @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, void *in, size_t len); int tap_send(const struct ctx *c, const void *data, size_t len); -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); void tap_update_mac(struct tap_hdr *taph, const unsigned char *eth_d, const unsigned char *eth_s); void tap_listen_handler(struct ctx *c, uint32_t events); diff --git a/tcp.c b/tcp.c index 4606f17..76b7b8d 100644 --- a/tcp.c +++ b/tcp.c @@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; +/** + * tcp_buf_seq_update - Sequences to update with length of frames once sent + * @seq: Pointer to sequence number sent to tap-side, to be updated + * @len: TCP payload length + */ +struct tcp_buf_seq_update { + uint32_t *seq; + uint16_t len; +}; + /* Static buffers */ /** @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t { #endif tcp4_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp4_l2_buf_used; /** @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t { #endif tcp6_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp6_l2_buf_used; /* recvmsg()/sendmsg() data for tap */ @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c) */ static void tcp_l2_data_buf_flush(struct ctx *c) { - tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + unsigned i; + size_t m; + + m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + for (i = 0; i < m; i++) + *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len; tcp6_l2_buf_used = 0; - tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + for (i = 0; i < m; i++) + *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len; tcp4_l2_buf_used = 0; } @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq) * @plen: Payload length at L4 * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer * @seq: Sequence number to be sent - * @now: Current timestamp + * @seq_update: Pointer to sequence number to update on successful send */ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, - ssize_t plen, int no_csum, uint32_t seq) + ssize_t plen, int no_csum, uint32_t seq, + uint32_t *seq_update)seq_update is always &conn->seq_to_tap, so there's no need for an additional parameter.{ struct iovec *iov; @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used]; uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen; + iov = tcp4_l2_iov + tcp4_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, check, seq); @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, } else if (CONN_V6(conn)) { struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used]; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen; + iov = tcp6_l2_iov + tcp6_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, NULL, seq); @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent; + uint32_t already_sent, seq; struct iovec *iov; already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) /* Finally, queue to tap */ plen = mss; + seq = conn->seq_to_tap;This will only be correct if tcp_l2_data_buf_flush() is *always* called between tcp_data_from_sock() calls for the same socket. That should be true for the normal course of things. However, couldn't it happen that we get a normal socket EPOLLIN event for a particular connection - calling tcp_data_from_sock() - but in the same epoll() round we also get a tap ack for the same connection which causes another call to tcp_data_from_sock() (with the change from patch 2/5). IIRC those would both happen before the deferred handling and therefore the data_buf_flush(). Not sure how to deal with that short of separate 'seq_queued' and 'seq_sent' counters in the connection structure, which is a bit unfortunate.for (i = 0; i < send_bufs; i++) { int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used; if (i == send_bufs - 1) plen = last_len; - tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap); - conn->seq_to_tap += plen; + tcp_data_to_tap(c, conn, plen, no_csum, seq, &conn->seq_to_tap); + seq += plen; } conn_flag(c, conn, ACK_FROM_TAP_DUE);-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Mon, 25 Sep 2023 14:47:52 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:Right, that....so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptorsI assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, yes?For AVX2 we have substantially more space here. Couldn't we put a conn (or seq) pointer in here at the cost of a few bytes MSS for non-AVX2 and zero cost for AVX2 (which is probably the majority case)?Yes, true. On the other hand, having this parallel array only affects readability I guess, whereas inserting pointers and lengths in tcp[46]_l2_buf_t actually decreases the usable MSS (not just on non-AVX2 x86, but also on other architectures). So I'd rather stick to this.Oh, right, I'll drop that.we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tap.c | 10 +++++++--- tap.h | 2 +- tcp.c | 43 ++++++++++++++++++++++++++++++++++++------- 3 files changed, 44 insertions(+), 11 deletions(-) diff --git a/tap.c b/tap.c index 93db989..b30ff81 100644 --- a/tap.c +++ b/tap.c @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c, * @c: Execution context * @iov: Array of buffers, each containing one frame (with L2 headers) * @n: Number of buffers/frames in @iov + * + * Return: number of frames actually sent */ -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) { size_t m; if (!n) - return; + return 0; if (c->mode == MODE_PASST) m = tap_send_frames_passt(c, iov, n); @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) m = tap_send_frames_pasta(c, iov, n); if (m < n) - debug("tap: dropped %lu frames of %lu due to short send", n - m, n); + debug("tap: failed to send %lu frames of %lu", n - m, n); pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0); + + return m; } /** diff --git a/tap.h b/tap.h index 021fb7c..952fafc 100644 --- a/tap.h +++ b/tap.h @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, void *in, size_t len); int tap_send(const struct ctx *c, const void *data, size_t len); -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); void tap_update_mac(struct tap_hdr *taph, const unsigned char *eth_d, const unsigned char *eth_s); void tap_listen_handler(struct ctx *c, uint32_t events); diff --git a/tcp.c b/tcp.c index 4606f17..76b7b8d 100644 --- a/tcp.c +++ b/tcp.c @@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; +/** + * tcp_buf_seq_update - Sequences to update with length of frames once sent + * @seq: Pointer to sequence number sent to tap-side, to be updated + * @len: TCP payload length + */ +struct tcp_buf_seq_update { + uint32_t *seq; + uint16_t len; +}; + /* Static buffers */ /** @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t { #endif tcp4_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp4_l2_buf_used; /** @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t { #endif tcp6_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp6_l2_buf_used; /* recvmsg()/sendmsg() data for tap */ @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c) */ static void tcp_l2_data_buf_flush(struct ctx *c) { - tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + unsigned i; + size_t m; + + m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + for (i = 0; i < m; i++) + *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len; tcp6_l2_buf_used = 0; - tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + for (i = 0; i < m; i++) + *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len; tcp4_l2_buf_used = 0; } @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq) * @plen: Payload length at L4 * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer * @seq: Sequence number to be sent - * @now: Current timestamp + * @seq_update: Pointer to sequence number to update on successful send */ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, - ssize_t plen, int no_csum, uint32_t seq) + ssize_t plen, int no_csum, uint32_t seq, + uint32_t *seq_update)seq_update is always &conn->seq_to_tap, so there's no need for an additional parameter.Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/ but with that change, it's not. Unless we drop that change from 2/5.{ struct iovec *iov; @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used]; uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen; + iov = tcp4_l2_iov + tcp4_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, check, seq); @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, } else if (CONN_V6(conn)) { struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used]; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen; + iov = tcp6_l2_iov + tcp6_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, NULL, seq); @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent; + uint32_t already_sent, seq; struct iovec *iov; already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) /* Finally, queue to tap */ plen = mss; + seq = conn->seq_to_tap;This will only be correct if tcp_l2_data_buf_flush() is *always* called between tcp_data_from_sock() calls for the same socket. That should be true for the normal course of things. However, couldn't it happen that we get a normal socket EPOLLIN event for a particular connection - calling tcp_data_from_sock() - but in the same epoll() round we also get a tap ack for the same connection which causes another call to tcp_data_from_sock() (with the change from patch 2/5). IIRC those would both happen before the deferred handling and therefore the data_buf_flush().Not sure how to deal with that short of separate 'seq_queued' and 'seq_sent' counters in the connection structure, which is a bit unfortunate.I wonder how bad it is if we call tcp_l2_data_buf_flush() unconditionally before calling tcp_data_from_sock() from tcp_tap_handler(). But again, maybe this is not needed at all, we should check that epoll detail from 2/5 first... -- Stefano
On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:On Mon, 25 Sep 2023 14:47:52 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yeah, I guess so. Actually.. I did just think of one other option. It avoids both any extra padding and a parallel array, but at the cost of additional work when frames are dropped. We could use that 16-bits of padding to store the TCP payload length. Then when we don't manage to send all our frames, we do another loop through and add up how many stream bytes we actually sent to update the seq pointer.On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:Right, that....so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptorsI assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, yes?For AVX2 we have substantially more space here. Couldn't we put a conn (or seq) pointer in here at the cost of a few bytes MSS for non-AVX2 and zero cost for AVX2 (which is probably the majority case)?Yes, true. On the other hand, having this parallel array only affects readability I guess, whereas inserting pointers and lengths in tcp[46]_l2_buf_t actually decreases the usable MSS (not just on non-AVX2 x86, but also on other architectures). So I'd rather stick to this.Even if we drop the change, it's a worryingly subtle constraint.Oh, right, I'll drop that.we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tap.c | 10 +++++++--- tap.h | 2 +- tcp.c | 43 ++++++++++++++++++++++++++++++++++++------- 3 files changed, 44 insertions(+), 11 deletions(-) diff --git a/tap.c b/tap.c index 93db989..b30ff81 100644 --- a/tap.c +++ b/tap.c @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c, * @c: Execution context * @iov: Array of buffers, each containing one frame (with L2 headers) * @n: Number of buffers/frames in @iov + * + * Return: number of frames actually sent */ -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) { size_t m; if (!n) - return; + return 0; if (c->mode == MODE_PASST) m = tap_send_frames_passt(c, iov, n); @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) m = tap_send_frames_pasta(c, iov, n); if (m < n) - debug("tap: dropped %lu frames of %lu due to short send", n - m, n); + debug("tap: failed to send %lu frames of %lu", n - m, n); pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0); + + return m; } /** diff --git a/tap.h b/tap.h index 021fb7c..952fafc 100644 --- a/tap.h +++ b/tap.h @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, void *in, size_t len); int tap_send(const struct ctx *c, const void *data, size_t len); -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); void tap_update_mac(struct tap_hdr *taph, const unsigned char *eth_d, const unsigned char *eth_s); void tap_listen_handler(struct ctx *c, uint32_t events); diff --git a/tcp.c b/tcp.c index 4606f17..76b7b8d 100644 --- a/tcp.c +++ b/tcp.c @@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; +/** + * tcp_buf_seq_update - Sequences to update with length of frames once sent + * @seq: Pointer to sequence number sent to tap-side, to be updated + * @len: TCP payload length + */ +struct tcp_buf_seq_update { + uint32_t *seq; + uint16_t len; +}; + /* Static buffers */ /** @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t { #endif tcp4_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp4_l2_buf_used; /** @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t { #endif tcp6_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp6_l2_buf_used; /* recvmsg()/sendmsg() data for tap */ @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c) */ static void tcp_l2_data_buf_flush(struct ctx *c) { - tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + unsigned i; + size_t m; + + m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + for (i = 0; i < m; i++) + *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len; tcp6_l2_buf_used = 0; - tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + for (i = 0; i < m; i++) + *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len; tcp4_l2_buf_used = 0; } @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq) * @plen: Payload length at L4 * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer * @seq: Sequence number to be sent - * @now: Current timestamp + * @seq_update: Pointer to sequence number to update on successful send */ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, - ssize_t plen, int no_csum, uint32_t seq) + ssize_t plen, int no_csum, uint32_t seq, + uint32_t *seq_update)seq_update is always &conn->seq_to_tap, so there's no need for an additional parameter.Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/ but with that change, it's not. Unless we drop that change from 2/5.{ struct iovec *iov; @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used]; uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen; + iov = tcp4_l2_iov + tcp4_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, check, seq); @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, } else if (CONN_V6(conn)) { struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used]; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen; + iov = tcp6_l2_iov + tcp6_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, NULL, seq); @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent; + uint32_t already_sent, seq; struct iovec *iov; already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) /* Finally, queue to tap */ plen = mss; + seq = conn->seq_to_tap;This will only be correct if tcp_l2_data_buf_flush() is *always* called between tcp_data_from_sock() calls for the same socket. That should be true for the normal course of things. However, couldn't it happen that we get a normal socket EPOLLIN event for a particular connection - calling tcp_data_from_sock() - but in the same epoll() round we also get a tap ack for the same connection which causes another call to tcp_data_from_sock() (with the change from patch 2/5). IIRC those would both happen before the deferred handling and therefore the data_buf_flush().-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonNot sure how to deal with that short of separate 'seq_queued' and 'seq_sent' counters in the connection structure, which is a bit unfortunate.I wonder how bad it is if we call tcp_l2_data_buf_flush() unconditionally before calling tcp_data_from_sock() from tcp_tap_handler(). But again, maybe this is not needed at all, we should check that epoll detail from 2/5 first...
On Thu, 28 Sep 2023 11:58:45 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:Hmm, yes. It's slightly more memory efficient, but the complexity seems a bit overkill to me.On Mon, 25 Sep 2023 14:47:52 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yeah, I guess so. Actually.. I did just think of one other option. It avoids both any extra padding and a parallel array, but at the cost of additional work when frames are dropped. We could use that 16-bits of padding to store the TCP payload length. Then when we don't manage to send all our frames, we do another loop through and add up how many stream bytes we actually sent to update the seq pointer.On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:Right, that....so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptorsI assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, yes?For AVX2 we have substantially more space here. Couldn't we put a conn (or seq) pointer in here at the cost of a few bytes MSS for non-AVX2 and zero cost for AVX2 (which is probably the majority case)?Yes, true. On the other hand, having this parallel array only affects readability I guess, whereas inserting pointers and lengths in tcp[46]_l2_buf_t actually decreases the usable MSS (not just on non-AVX2 x86, but also on other architectures). So I'd rather stick to this.Another option to avoid this...Even if we drop the change, it's a worryingly subtle constraint.Oh, right, I'll drop that.we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tap.c | 10 +++++++--- tap.h | 2 +- tcp.c | 43 ++++++++++++++++++++++++++++++++++++------- 3 files changed, 44 insertions(+), 11 deletions(-) diff --git a/tap.c b/tap.c index 93db989..b30ff81 100644 --- a/tap.c +++ b/tap.c @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c, * @c: Execution context * @iov: Array of buffers, each containing one frame (with L2 headers) * @n: Number of buffers/frames in @iov + * + * Return: number of frames actually sent */ -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) { size_t m; if (!n) - return; + return 0; if (c->mode == MODE_PASST) m = tap_send_frames_passt(c, iov, n); @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) m = tap_send_frames_pasta(c, iov, n); if (m < n) - debug("tap: dropped %lu frames of %lu due to short send", n - m, n); + debug("tap: failed to send %lu frames of %lu", n - m, n); pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0); + + return m; } /** diff --git a/tap.h b/tap.h index 021fb7c..952fafc 100644 --- a/tap.h +++ b/tap.h @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, void *in, size_t len); int tap_send(const struct ctx *c, const void *data, size_t len); -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); void tap_update_mac(struct tap_hdr *taph, const unsigned char *eth_d, const unsigned char *eth_s); void tap_listen_handler(struct ctx *c, uint32_t events); diff --git a/tcp.c b/tcp.c index 4606f17..76b7b8d 100644 --- a/tcp.c +++ b/tcp.c @@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; +/** + * tcp_buf_seq_update - Sequences to update with length of frames once sent + * @seq: Pointer to sequence number sent to tap-side, to be updated + * @len: TCP payload length + */ +struct tcp_buf_seq_update { + uint32_t *seq; + uint16_t len; +}; + /* Static buffers */ /** @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t { #endif tcp4_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp4_l2_buf_used; /** @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t { #endif tcp6_l2_buf[TCP_FRAMES_MEM]; +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; + static unsigned int tcp6_l2_buf_used; /* recvmsg()/sendmsg() data for tap */ @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c) */ static void tcp_l2_data_buf_flush(struct ctx *c) { - tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + unsigned i; + size_t m; + + m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); + for (i = 0; i < m; i++) + *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len; tcp6_l2_buf_used = 0; - tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); + for (i = 0; i < m; i++) + *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len; tcp4_l2_buf_used = 0; } @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq) * @plen: Payload length at L4 * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer * @seq: Sequence number to be sent - * @now: Current timestamp + * @seq_update: Pointer to sequence number to update on successful send */ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, - ssize_t plen, int no_csum, uint32_t seq) + ssize_t plen, int no_csum, uint32_t seq, + uint32_t *seq_update)seq_update is always &conn->seq_to_tap, so there's no need for an additional parameter.Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/ but with that change, it's not. Unless we drop that change from 2/5.{ struct iovec *iov; @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used]; uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update; + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen; + iov = tcp4_l2_iov + tcp4_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, check, seq); @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, } else if (CONN_V6(conn)) { struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used]; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update; + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen; + iov = tcp6_l2_iov + tcp6_l2_buf_used++; iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, NULL, seq); @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent; + uint32_t already_sent, seq; struct iovec *iov; already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) /* Finally, queue to tap */ plen = mss; + seq = conn->seq_to_tap;This will only be correct if tcp_l2_data_buf_flush() is *always* called between tcp_data_from_sock() calls for the same socket. That should be true for the normal course of things. However, couldn't it happen that we get a normal socket EPOLLIN event for a particular connection - calling tcp_data_from_sock() - but in the same epoll() round we also get a tap ack for the same connection which causes another call to tcp_data_from_sock() (with the change from patch 2/5). IIRC those would both happen before the deferred handling and therefore the data_buf_flush().> > Not sure how to deal with that short of separate 'seq_queued' and > > 'seq_sent' counters in the connection structure, which is a bit > > unfortunate. > > I wonder how bad it is if we call tcp_l2_data_buf_flush() > unconditionally before calling tcp_data_from_sock() from > tcp_tap_handler(). But again, maybe this is not needed at all, we > should check that epoll detail from 2/5 first...other than this one, would be to use that external table to update sequence numbers *in the frames* as we send stuff out. -- Stefano
On Fri, Sep 29, 2023 at 05:19:50PM +0200, Stefano Brivio wrote:On Thu, 28 Sep 2023 11:58:45 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:More importantly, I forgot the fact that by the time we're sending the frames, we don't know what connection they're associated with any more. [snip]On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:Hmm, yes. It's slightly more memory efficient, but the complexity seems a bit overkill to me.On Mon, 25 Sep 2023 14:47:52 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yeah, I guess so. Actually.. I did just think of one other option. It avoids both any extra padding and a parallel array, but at the cost of additional work when frames are dropped. We could use that 16-bits of padding to store the TCP payload length. Then when we don't manage to send all our frames, we do another loop through and add up how many stream bytes we actually sent to update the seq pointer.On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote: > ...so that we'll retry sending them, instead of more-or-less silently > dropping them. This happens quite frequently if our sending buffer on > the UNIX domain socket is heavily constrained (for instance, by the > 208 KiB default memory limit). > > It might be argued that dropping frames is part of the expected TCP > flow: we don't dequeue those from the socket anyway, so we'll > eventually retransmit them. > > But we don't need the receiver to tell us (by the way of duplicate or > missing ACKs) that we couldn't send them: we already know as > sendmsg() reports that. This seems to considerably increase > throughput stability and throughput itself for TCP connections with > default wmem_max values. > > Unfortunately, the 16 bits left as padding in the frame descriptors I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, yes?Right, that.For AVX2 we have substantially more space here. Couldn't we put a conn (or seq) pointer in here at the cost of a few bytes MSS for non-AVX2 and zero cost for AVX2 (which is probably the majority case)?Yes, true. On the other hand, having this parallel array only affects readability I guess, whereas inserting pointers and lengths in tcp[46]_l2_buf_t actually decreases the usable MSS (not just on non-AVX2 x86, but also on other architectures). So I'd rather stick to this.Not really sure what you're proposing there. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonAnother option to avoid this...Even if we drop the change, it's a worryingly subtle constraint.> @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > /* Finally, queue to tap */ > plen = mss; > + seq = conn->seq_to_tap; This will only be correct if tcp_l2_data_buf_flush() is *always* called between tcp_data_from_sock() calls for the same socket. That should be true for the normal course of things. However, couldn't it happen that we get a normal socket EPOLLIN event for a particular connection - calling tcp_data_from_sock() - but in the same epoll() round we also get a tap ack for the same connection which causes another call to tcp_data_from_sock() (with the change from patch 2/5). IIRC those would both happen before the deferred handling and therefore the data_buf_flush().Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/ but with that change, it's not. Unless we drop that change from 2/5.> > Not sure how to deal with that short of separate 'seq_queued' and > > 'seq_sent' counters in the connection structure, which is a bit > > unfortunate. > > I wonder how bad it is if we call tcp_l2_data_buf_flush() > unconditionally before calling tcp_data_from_sock() from > tcp_tap_handler(). But again, maybe this is not needed at all, we > should check that epoll detail from 2/5 first...other than this one, would be to use that external table to update sequence numbers *in the frames* as we send stuff out.
On Tue, 3 Oct 2023 14:22:59 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Fri, Sep 29, 2023 at 05:19:50PM +0200, Stefano Brivio wrote:Oh, I thought you wanted to rebuild the information about the connection by looking into the hash table or something like that.On Thu, 28 Sep 2023 11:58:45 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:More importantly, I forgot the fact that by the time we're sending the frames, we don't know what connection they're associated with any more.On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:Hmm, yes. It's slightly more memory efficient, but the complexity seems a bit overkill to me.On Mon, 25 Sep 2023 14:47:52 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote: > > ...so that we'll retry sending them, instead of more-or-less silently > > dropping them. This happens quite frequently if our sending buffer on > > the UNIX domain socket is heavily constrained (for instance, by the > > 208 KiB default memory limit). > > > > It might be argued that dropping frames is part of the expected TCP > > flow: we don't dequeue those from the socket anyway, so we'll > > eventually retransmit them. > > > > But we don't need the receiver to tell us (by the way of duplicate or > > missing ACKs) that we couldn't send them: we already know as > > sendmsg() reports that. This seems to considerably increase > > throughput stability and throughput itself for TCP connections with > > default wmem_max values. > > > > Unfortunately, the 16 bits left as padding in the frame descriptors > > I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, > yes? Right, that. > For AVX2 we have substantially more space here. Couldn't we put > a conn (or seq) pointer in here at the cost of a few bytes MSS for > non-AVX2 and zero cost for AVX2 (which is probably the majority case)? Yes, true. On the other hand, having this parallel array only affects readability I guess, whereas inserting pointers and lengths in tcp[46]_l2_buf_t actually decreases the usable MSS (not just on non-AVX2 x86, but also on other architectures). So I'd rather stick to this.Yeah, I guess so. Actually.. I did just think of one other option. It avoids both any extra padding and a parallel array, but at the cost of additional work when frames are dropped. We could use that 16-bits of padding to store the TCP payload length. Then when we don't manage to send all our frames, we do another loop through and add up how many stream bytes we actually sent to update the seq pointer.[snip]That tcp_l2_buf_fill_headers() calculates the sequence from conn->seq_to_tap plus a cumulative count from that table, instead of passing it from the caller. -- StefanoNot really sure what you're proposing there.Another option to avoid this...> > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > > > /* Finally, queue to tap */ > > plen = mss; > > + seq = conn->seq_to_tap; > > This will only be correct if tcp_l2_data_buf_flush() is *always* > called between tcp_data_from_sock() calls for the same socket. That > should be true for the normal course of things. However, couldn't it > happen that we get a normal socket EPOLLIN event for a particular > connection - calling tcp_data_from_sock() - but in the same epoll() > round we also get a tap ack for the same connection which causes > another call to tcp_data_from_sock() (with the change from patch > 2/5). IIRC those would both happen before the deferred handling and > therefore the data_buf_flush(). Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/ but with that change, it's not. Unless we drop that change from 2/5.Even if we drop the change, it's a worryingly subtle constraint.> > Not sure how to deal with that short of separate 'seq_queued' and > > 'seq_sent' counters in the connection structure, which is a bit > > unfortunate. > > I wonder how bad it is if we call tcp_l2_data_buf_flush() > unconditionally before calling tcp_data_from_sock() from > tcp_tap_handler(). But again, maybe this is not needed at all, we > should check that epoll detail from 2/5 first...other than this one, would be to use that external table to update sequence numbers *in the frames* as we send stuff out.
On Thu, Oct 05, 2023 at 08:19:00AM +0200, Stefano Brivio wrote:On Tue, 3 Oct 2023 14:22:59 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:No, I guess we could do that, but I think it would be substantially messier than the parallel array approach.On Fri, Sep 29, 2023 at 05:19:50PM +0200, Stefano Brivio wrote:Oh, I thought you wanted to rebuild the information about the connection by looking into the hash table or something like that.On Thu, 28 Sep 2023 11:58:45 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:More importantly, I forgot the fact that by the time we're sending the frames, we don't know what connection they're associated with any more.On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote: > On Mon, 25 Sep 2023 14:47:52 +1000 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote: > > > ...so that we'll retry sending them, instead of more-or-less silently > > > dropping them. This happens quite frequently if our sending buffer on > > > the UNIX domain socket is heavily constrained (for instance, by the > > > 208 KiB default memory limit). > > > > > > It might be argued that dropping frames is part of the expected TCP > > > flow: we don't dequeue those from the socket anyway, so we'll > > > eventually retransmit them. > > > > > > But we don't need the receiver to tell us (by the way of duplicate or > > > missing ACKs) that we couldn't send them: we already know as > > > sendmsg() reports that. This seems to considerably increase > > > throughput stability and throughput itself for TCP connections with > > > default wmem_max values. > > > > > > Unfortunately, the 16 bits left as padding in the frame descriptors > > > > I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, > > yes? > > Right, that. > > > For AVX2 we have substantially more space here. Couldn't we put > > a conn (or seq) pointer in here at the cost of a few bytes MSS for > > non-AVX2 and zero cost for AVX2 (which is probably the majority case)? > > Yes, true. On the other hand, having this parallel array only affects > readability I guess, whereas inserting pointers and lengths in > tcp[46]_l2_buf_t actually decreases the usable MSS (not just on > non-AVX2 x86, but also on other architectures). So I'd rather stick to > this. Yeah, I guess so. Actually.. I did just think of one other option. It avoids both any extra padding and a parallel array, but at the cost of additional work when frames are dropped. We could use that 16-bits of padding to store the TCP payload length. Then when we don't manage to send all our frames, we do another loop through and add up how many stream bytes we actually sent to update the seq pointer.Hmm, yes. It's slightly more memory efficient, but the complexity seems a bit overkill to me.Ah, right, that could work. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson[snip]That tcp_l2_buf_fill_headers() calculates the sequence from conn->seq_to_tap plus a cumulative count from that table, instead of passing it from the caller.Not really sure what you're proposing there.> > > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > > > > > /* Finally, queue to tap */ > > > plen = mss; > > > + seq = conn->seq_to_tap; > > > > This will only be correct if tcp_l2_data_buf_flush() is *always* > > called between tcp_data_from_sock() calls for the same socket. That > > should be true for the normal course of things. However, couldn't it > > happen that we get a normal socket EPOLLIN event for a particular > > connection - calling tcp_data_from_sock() - but in the same epoll() > > round we also get a tap ack for the same connection which causes > > another call to tcp_data_from_sock() (with the change from patch > > 2/5). IIRC those would both happen before the deferred handling and > > therefore the data_buf_flush(). > > Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/ > but with that change, it's not. Unless we drop that change from 2/5. Even if we drop the change, it's a worryingly subtle constraint.Another option to avoid this...> > Not sure how to deal with that short of separate 'seq_queued' and > > 'seq_sent' counters in the connection structure, which is a bit > > unfortunate. > > I wonder how bad it is if we call tcp_l2_data_buf_flush() > unconditionally before calling tcp_data_from_sock() from > tcp_tap_handler(). But again, maybe this is not needed at all, we > should check that epoll detail from 2/5 first...other than this one, would be to use that external table to update sequence numbers *in the frames* as we send stuff out.
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- passt.1 | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/passt.1 b/passt.1 index 1ad4276..bcbe6fd 100644 --- a/passt.1 +++ b/passt.1 @@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the current sending buffer size to guest or target namespace. This might affect throughput of TCP connections. +.SS Tuning for high throughput + +On Linux, by default, the maximum memory that can be set for receive and send +socket buffers is 208 KiB. Those limits are set by the +\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files, +see \fBsocket\fR(7). + +As of Linux 6.5, while the TCP implementation can dynamically shrink buffers +depending on utilisation even above those limits, such a small limit will +reflect on the advertised TCP window at the beginning of a connection, and the +buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed +these limits anyway. + +Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and +\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and +\fIwmem_max\fR limits because the automatic adjustment provided by the TCP +implementation is then disabled. + +As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and +will not set TCP socket buffer sizes if they are lower than 2 MiB, because this +would affect the maximum size of TCP buffers for the whole duration of a +connection. + +Note that 208 KiB is, accounting for kernel overhead, enough to fit less than +three TCP packets at the default MSS. In applications where high throughput is +expected, it is therefore advisable to increase those limits to at least 2 MiB, +or even 16 MiB: + +.nf + sysctl -w net.core.rmem_max=$((16 << 20) + sysctl -w net.core.wmem_max=$((16 << 20) +.fi + .SH LIMITATIONS Currently, IGMP/MLD proxying (RFC 4605) and support for SCTP (RFC 4960) are not -- 2.39.2
On Sat, Sep 23, 2023 at 12:06:10AM +0200, Stefano Brivio wrote:Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- passt.1 | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/passt.1 b/passt.1 index 1ad4276..bcbe6fd 100644 --- a/passt.1 +++ b/passt.1 @@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the current sending buffer size to guest or target namespace. This might affect throughput of TCP connections. +.SS Tuning for high throughput + +On Linux, by default, the maximum memory that can be set for receive and send +socket buffers is 208 KiB. Those limits are set by the +\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files, +see \fBsocket\fR(7). + +As of Linux 6.5, while the TCP implementation can dynamically shrink buffers +depending on utilisation even above those limits, such a small limit will"shrink buffers" and "even above those limits" don't seem to quite work together.+reflect on the advertised TCP window at the beginning of a connection, and theHmmm.... while [rw]mem_max might limit that initial window size, I wouldn't expect increasing the limits alone to increase that initial window size: wouldn't that instead be affected by the TCP default buffer size i.e. the middle value in net.ipv4.tcp_rmem?+buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed +these limits anyway. + +Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and +\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and +\fIwmem_max\fR limits because the automatic adjustment provided by the TCP +implementation is then disabled. + +As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and +will not set TCP socket buffer sizes if they are lower than 2 MiB, because this +would affect the maximum size of TCP buffers for the whole duration of a +connection. + +Note that 208 KiB is, accounting for kernel overhead, enough to fit less than +three TCP packets at the default MSS. In applications where high throughput is +expected, it is therefore advisable to increase those limits to at least 2 MiB, +or even 16 MiB: + +.nf + sysctl -w net.core.rmem_max=$((16 << 20) + sysctl -w net.core.wmem_max=$((16 << 20) +.fiAs noted in a previous mail, empirically, this doesn't necessarily seem to work better for me. I'm wondering if we'd be better off never touching RCFBUF and SNDBUF for TCP sockets, and letting the kernel do its adaptive thing. We probably still want to expand the buffers as much as we can for the Unix socket, though. And we likely still want expanded limits for the tests so that iperf3 can use large buffers+ .SH LIMITATIONS Currently, IGMP/MLD proxying (RFC 4605) and support for SCTP (RFC 4960) are not-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Mon, 25 Sep 2023 14:57:40 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Sat, Sep 23, 2023 at 12:06:10AM +0200, Stefano Brivio wrote:Oops. I guess I should simply s/shrink/grow/ here.Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- passt.1 | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/passt.1 b/passt.1 index 1ad4276..bcbe6fd 100644 --- a/passt.1 +++ b/passt.1 @@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the current sending buffer size to guest or target namespace. This might affect throughput of TCP connections. +.SS Tuning for high throughput + +On Linux, by default, the maximum memory that can be set for receive and send +socket buffers is 208 KiB. Those limits are set by the +\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files, +see \fBsocket\fR(7). + +As of Linux 6.5, while the TCP implementation can dynamically shrink buffers +depending on utilisation even above those limits, such a small limit will"shrink buffers" and "even above those limits" don't seem to quite work together.If we don't use SO_RCVBUF, yes... but we currently do, and with that, we can get a much larger initial window (as we do now). On the other hand, maybe, as mentioned in my follow-up about 3/5, we should drop SO_RCVBUF for TCP sockets.+reflect on the advertised TCP window at the beginning of a connection, and theHmmm.... while [rw]mem_max might limit that initial window size, I wouldn't expect increasing the limits alone to increase that initial window size: wouldn't that instead be affected by the TCP default buffer size i.e. the middle value in net.ipv4.tcp_rmem?Right. Let's keep this patch for a later time then, and meanwhile check if we should drop SO_RCVBUF, SO_SNDBUF, or both, for TCP sockets. -- Stefano+buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed +these limits anyway. + +Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and +\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and +\fIwmem_max\fR limits because the automatic adjustment provided by the TCP +implementation is then disabled. + +As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and +will not set TCP socket buffer sizes if they are lower than 2 MiB, because this +would affect the maximum size of TCP buffers for the whole duration of a +connection. + +Note that 208 KiB is, accounting for kernel overhead, enough to fit less than +three TCP packets at the default MSS. In applications where high throughput is +expected, it is therefore advisable to increase those limits to at least 2 MiB, +or even 16 MiB: + +.nf + sysctl -w net.core.rmem_max=$((16 << 20) + sysctl -w net.core.wmem_max=$((16 << 20) +.fiAs noted in a previous mail, empirically, this doesn't necessarily seem to work better for me. I'm wondering if we'd be better off never touching RCFBUF and SNDBUF for TCP sockets, and letting the kernel do its adaptive thing. We probably still want to expand the buffers as much as we can for the Unix socket, though. And we likely still want expanded limits for the tests so that iperf3 can use large buffers
On Wed, Sep 27, 2023 at 07:06:16PM +0200, Stefano Brivio wrote:On Mon, 25 Sep 2023 14:57:40 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Or "resize" would work too.On Sat, Sep 23, 2023 at 12:06:10AM +0200, Stefano Brivio wrote:Oops. I guess I should simply s/shrink/grow/ here.Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- passt.1 | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/passt.1 b/passt.1 index 1ad4276..bcbe6fd 100644 --- a/passt.1 +++ b/passt.1 @@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the current sending buffer size to guest or target namespace. This might affect throughput of TCP connections. +.SS Tuning for high throughput + +On Linux, by default, the maximum memory that can be set for receive and send +socket buffers is 208 KiB. Those limits are set by the +\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files, +see \fBsocket\fR(7). + +As of Linux 6.5, while the TCP implementation can dynamically shrink buffers +depending on utilisation even above those limits, such a small limit will"shrink buffers" and "even above those limits" don't seem to quite work together.Good point.If we don't use SO_RCVBUF, yes... but we currently do, and with that, we can get a much larger initial window (as we do now).+reflect on the advertised TCP window at the beginning of a connection, and theHmmm.... while [rw]mem_max might limit that initial window size, I wouldn't expect increasing the limits alone to increase that initial window size: wouldn't that instead be affected by the TCP default buffer size i.e. the middle value in net.ipv4.tcp_rmem?On the other hand, maybe, as mentioned in my follow-up about 3/5, we should drop SO_RCVBUF for TCP sockets.Ok.Makes sense to me. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonRight. Let's keep this patch for a later time then, and meanwhile check if we should drop SO_RCVBUF, SO_SNDBUF, or both, for TCP sockets.+buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed +these limits anyway. + +Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and +\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and +\fIwmem_max\fR limits because the automatic adjustment provided by the TCP +implementation is then disabled. + +As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and +will not set TCP socket buffer sizes if they are lower than 2 MiB, because this +would affect the maximum size of TCP buffers for the whole duration of a +connection. + +Note that 208 KiB is, accounting for kernel overhead, enough to fit less than +three TCP packets at the default MSS. In applications where high throughput is +expected, it is therefore advisable to increase those limits to at least 2 MiB, +or even 16 MiB: + +.nf + sysctl -w net.core.rmem_max=$((16 << 20) + sysctl -w net.core.wmem_max=$((16 << 20) +.fiAs noted in a previous mail, empirically, this doesn't necessarily seem to work better for me. I'm wondering if we'd be better off never touching RCFBUF and SNDBUF for TCP sockets, and letting the kernel do its adaptive thing. We probably still want to expand the buffers as much as we can for the Unix socket, though. And we likely still want expanded limits for the tests so that iperf3 can use large buffers
On Sat, Sep 23, 2023 at 12:06:05AM +0200, Stefano Brivio wrote:The fundamental patch here is 3/5, which is a workaround for a rather surprising kernel behaviour we seem to be hitting. This all comes from the investigation around https://bugs.passt.top/show_bug.cgi?id=74. I can't hit stalls anymore and throughput looks finally good to me (~3.5gbps with 208 KiB rmem_max and wmem_max), but... please test.Write site issue, testing results I replied with a bunch of test information already, but that was all related to the specifically read-side issue: I used 16MiB wmem_max throughout, but limited the read side buffer either with rmem_max or SO_RCVBUF. I've now done some tests looking specifically for write side issues. I basically reversed the setup, with rmem_max set to 4MiB throughout, but wmem_max limited to 256kiB. With no patches applied, I easily get a stall, although the exact details are a little different from the read-side stall: rather than being consistently 0 there are a few small bursts of traffic on both sides. With 2/5 applied, there doesn't appear to be much difference in behaviour. With 3/5 applied, I can no longer reproduce stalls, but throughput isn't very good. With 4/5 applied, throughput seems to improve notably (from ~300Mbps to ~2.5Gbps, though it's not surprisingly variable from second to second). Tentative conclusions: * The primary cause of the stalls appears to be the kernel bug identified, where the window isn't properly recalculated after MSG_TRUNC. 3/5 appears to successfully work around that bug. I think getting that merged is our top priority. * 2/5 makes logical sense to me, but I don't see a lot of evidence of it changing the behaviour here much. I think we hold it back for now, polish it a bit, maybe reconsider it as part of a broader rethink of the STALLED flag. * 4/5 doesn't appear to be linked to the stalls per se, but does appear to generally improve behaviour with limited wmem_max. I think we can improve the implementation a bit, then look at merging as the second priority. Open questions: * Even with the fixes, why does very large rmem_max seem to cause wildly variable and not great throughput? * Why does explicitly limiting RCVBUF usually, but not always, cause very poor throughput but without stalling? * Given the above oddities, is there any value to us setting RCVBUF for TCP sockets, rather than just letting the kernel adapt it. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson