[PATCH v7 0/3] Support for SO_PEEK_OFF
v7: Only patch #3 updated. Jon Maloy (3): tcp: move seq_to_tap update to when frame is queued tcp: leverage support of SO_PEEK_OFF socket option when available tcp: allow retransmit when peer receive window is zero tcp.c | 164 +++++++++++++++++++++++++++++++++++++++-------------- tcp_conn.h | 2 + 2 files changed, 123 insertions(+), 43 deletions(-) -- 2.45.0
commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames")
delayed update of conn->seq_to_tap until the moment the corresponding
frame has been successfully pushed out. This has the advantage that we
immediately can make a new attempt to transmit a frame after a failed
trasnmit, rather than waiting for the peer to later discover a gap and
trigger the fast retransmit mechanism to solve the problem.
This approach has turned out to cause a problem with spurious sequence
number updates during peer-initiated retransmits, and we have realized
it may not be the best way to solve the above issue.
We now restore the previous method, by updating the said field at the
moment a frame is added to the outqueue. To retain the advantage of
having a quick re-attempt based on local failure detection, we now scan
through the part of the outqueue that had do be dropped, and restore the
sequence counter for each affected connection to the most appropriate
value.
Signed-off-by: Jon Maloy
On Fri, May 24, 2024 at 01:26:54PM -0400, Jon Maloy wrote:
commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem.
This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue.
We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value.
Signed-off-by: Jon Maloy
This still has the issues I pointed out on the last revision... [snip]
+/** + * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission + * @conns: Array of connection pointers corresponding to queued frames + * @frames: Two-dimensional array containing queued frames with sub-iovs + * @num_frames: Number of entries in the two arrays to be compared + */ +static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec (*frames)[TCP_NUM_IOVS], + int num_frames) +{ + int i; + + for (i = 0; i < num_frames; i++) { + struct tcp_tap_conn *conn = conns[i]; + struct tcphdr *th = frames[i][TCP_IOV_PAYLOAD].iov_base; + uint32_t seq = ntohl(th->seq); + + if (SEQ_LE(conn->seq_to_tap, seq)) + continue; + + conn->seq_to_tap = seq;
...one trivial - this would be clearer without the continue - ...
+ } +} + /** * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ static void tcp_payload_flush(const struct ctx *c) { - unsigned i; size_t m;
m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS, tcp6_payload_used); - for (i = 0; i < m; i++) - *tcp6_seq_update[i].seq += tcp6_seq_update[i].len; + if (m != tcp6_payload_used) { + tcp_revert_seq(tcp6_frame_conns, &tcp6_l2_iov[m], + tcp6_payload_used - m);
.. and one fatal - you're calling this with non-matching entries from frame_conns[] and l2_iov[]. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
From linux-6.9.0 the kernel will contain
commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").
This new feature makes is possible to call recv_msg(MSG_PEEK) and make
it start reading data from a given offset set by the SO_PEEK_OFF socket
option. This way, we can avoid repeated reading of already read bytes of
a received message, hence saving read cycles when forwarding TCP
messages in the host->name space direction.
In this commit, we add functionality to leverage this feature when
available, while we fall back to the previous behavior when not.
Measurements with iperf3 shows that throughput increases with 15-20
percent in the host->namespace direction when this feature is used.
Signed-off-by: Jon Maloy
On Fri, May 24, 2024 at 01:26:55PM -0400, Jon Maloy wrote:
From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").
This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction.
In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not.
Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used.
Signed-off-by: Jon Maloy
--- tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 51 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 146ab8f..01898f1 100644 --- a/tcp.c +++ b/tcp.c @@ -509,6 +509,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];
+/* Does the kernel support TCP_PEEK_OFF? */ +static bool peek_offset_cap; + /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV];
@@ -524,6 +527,20 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE];
+/** + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported + * @s: Socket to update + * @offset: Offset in bytes + */ +static void tcp_set_peek_offset(int s, int offset) +{ + if (!peek_offset_cap) + return; + + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) + err("Failed to set SO_PEEK_OFF to %i in socket %i", offset, s);
I feel like we need to reset the connection if we ever reach here. This means that SO_PEEK_OFF is now out of sync and we apparently can't fix it. If we keep the connection alive, we will inevitably send incorrect data across it, which seems pretty bad. Or, maybe we think this is unlikely enough we could just die(). Otherwise, LGTM. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the peer, while it is unable to send out window updates even after
reads have freed up enough buffer space to permit a larger window.
In this situation, new window advertisemnts from the peer can only be
triggered by data packets arriving from this side.
However, such packets are never sent, because the zero-window condition
currently prevents this side from sending out any packets whatsoever
to the peer.
We notice that the above bug is triggered *only* after the peer has
dropped an arriving packet because of severe memory squeeze, and that we
hence always enter a retransmission situation when this occurs. This
also means that it goes against the RFC-9293 recommendation that a
previously advertised window never should shrink.
RFC-9293 seems to permit that we can send up to the right edge of the
last advertised non-zero window in such cases, so that is what we do
to resolve this situation. However, we use the above mechanism only for
timer-induced retransmits, while the fast-retransmit mechanism won't
be affected by this change.
It should be noted that although this solves the problem we have at
hand, it is a work-around, and not a genuine solution to the described
kernel bug.
Signed-off-by: Jon Maloy
participants (2)
-
David Gibson
-
Jon Maloy