On Fri, Sep 29, 2023 at 05:19:50PM +0200, Stefano Brivio wrote:On Thu, 28 Sep 2023 11:58:45 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:More importantly, I forgot the fact that by the time we're sending the frames, we don't know what connection they're associated with any more. [snip]On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:Hmm, yes. It's slightly more memory efficient, but the complexity seems a bit overkill to me.On Mon, 25 Sep 2023 14:47:52 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yeah, I guess so. Actually.. I did just think of one other option. It avoids both any extra padding and a parallel array, but at the cost of additional work when frames are dropped. We could use that 16-bits of padding to store the TCP payload length. Then when we don't manage to send all our frames, we do another loop through and add up how many stream bytes we actually sent to update the seq pointer.On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote: > ...so that we'll retry sending them, instead of more-or-less silently > dropping them. This happens quite frequently if our sending buffer on > the UNIX domain socket is heavily constrained (for instance, by the > 208 KiB default memory limit). > > It might be argued that dropping frames is part of the expected TCP > flow: we don't dequeue those from the socket anyway, so we'll > eventually retransmit them. > > But we don't need the receiver to tell us (by the way of duplicate or > missing ACKs) that we couldn't send them: we already know as > sendmsg() reports that. This seems to considerably increase > throughput stability and throughput itself for TCP connections with > default wmem_max values. > > Unfortunately, the 16 bits left as padding in the frame descriptors I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, yes?Right, that.For AVX2 we have substantially more space here. Couldn't we put a conn (or seq) pointer in here at the cost of a few bytes MSS for non-AVX2 and zero cost for AVX2 (which is probably the majority case)?Yes, true. On the other hand, having this parallel array only affects readability I guess, whereas inserting pointers and lengths in tcp[46]_l2_buf_t actually decreases the usable MSS (not just on non-AVX2 x86, but also on other architectures). So I'd rather stick to this.Not really sure what you're proposing there. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonAnother option to avoid this...Even if we drop the change, it's a worryingly subtle constraint.> @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > /* Finally, queue to tap */ > plen = mss; > + seq = conn->seq_to_tap; This will only be correct if tcp_l2_data_buf_flush() is *always* called between tcp_data_from_sock() calls for the same socket. That should be true for the normal course of things. However, couldn't it happen that we get a normal socket EPOLLIN event for a particular connection - calling tcp_data_from_sock() - but in the same epoll() round we also get a tap ack for the same connection which causes another call to tcp_data_from_sock() (with the change from patch 2/5). IIRC those would both happen before the deferred handling and therefore the data_buf_flush().Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/ but with that change, it's not. Unless we drop that change from 2/5.> > Not sure how to deal with that short of separate 'seq_queued' and > > 'seq_sent' counters in the connection structure, which is a bit > > unfortunate. > > I wonder how bad it is if we call tcp_l2_data_buf_flush() > unconditionally before calling tcp_data_from_sock() from > tcp_tap_handler(). But again, maybe this is not needed at all, we > should check that epoll detail from 2/5 first...other than this one, would be to use that external table to update sequence numbers *in the frames* as we send stuff out.