This is a draft series culminating in greatly simplifying the handling of the epoll event mask for TCP sockets. Doing that requires two major preliminaries: 1) We also need to alter event mask handling for the tap interface, to track and report when it's ready to accept more data. We need this to "unstick" connections where we have pending data on the socket, but weren't able to forward it because we ran out of tap interface buffer space. 2) We need to alter handling of ack sequence numbers to the tap device. Without this, the event mask changes expose a fragility where depending on the precise order things are called we could update the ack pointer without actually sending an ack to the tap interface. This is not ready to go yet. For one thing it tanks some of the throughput numbers for reasons I haven't yet diagnosed. It could also certainly do with another set of eyes looking critically over the event logic changes. In addition, the fragility of the ack-to-tap handling remains. The series fixes the specific problem, but it remains non-obviously unsafe to call tcp_update_seqack_from_tap() if a packet isn't going to be immediately and unconditionally sent to the guest. I've been trying to figure out how to make that more robust without introducing additional TCP_INFO calls, but my brain's seizing up a bit at this point. Patches 1..4/10 are preliminary cleanups which should be safe, though. Feel free to apply as many of those ones as you're happy with. David Gibson (10): tcp: Make some extra functions private tcp: Clean up tcpi_snd_wnd probing tcp: Simplify ifdef logic in tcp_update_seqack_wnd() tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean tcp: On socket EPOLLOUT, send new ACK to tap immediately tap: Re-introduce EPOLLET for tap connections tap: Keep track of whether there might be space in the tap buffers tcp: Keep track of connections blocked due to a full tap interface tcp: Move deferred handling functions later in tcp.c tcp: Simplify epoll event mask management tap.c | 58 ++++++++--- tap.h | 1 + tcp.c | 265 ++++++++++++++++++++++++++++--------------------- tcp.h | 13 +-- tcp_buf.c | 15 +-- tcp_buf.h | 6 +- tcp_conn.h | 1 + tcp_internal.h | 6 +- 8 files changed, 222 insertions(+), 143 deletions(-) -- 2.46.0
tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c, and have no prototype in a header. Make them static. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tcp.c b/tcp.c index f9fe1b9a..14b48a84 100644 --- a/tcp.c +++ b/tcp.c @@ -1235,7 +1235,7 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, * * Return: negative error code on connection reset, 0 otherwise */ -int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) +static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) { return tcp_buf_send_flag(c, conn, flags); } @@ -2477,7 +2477,7 @@ static void tcp_sock_refill_init(const struct ctx *c) * * Return: true if supported, false otherwise */ -bool tcp_probe_peek_offset_cap(sa_family_t af) +static bool tcp_probe_peek_offset_cap(sa_family_t af) { bool ret = false; int s, optv = 0; -- 2.46.0
When available, we want to retrieve our socket peer's advertised window and forward that to the guest. That information has been available from the kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035. Currently our probing for this is a bit odd. The HAS_SND_WND define determines if our headers include the tcp_snd_wnd field, but that doesn't necessarily mean the running kernel supports it. Currently we start by assuming it's _not_ available, but mark it as available if we ever see a non-zero value in the field. This is a bit hit and miss in two ways: * Zero is perfectly possible window the peer could report, so we can get false negatives * We're reading TCP_INFO into a local variable, which might not be zero initialised, so if the kernel _doesn't_ write it it could have non-zero garbage, giving us false positives. We can use a more direct way of probing for this: getsockopt() reports the length of the information retreived. So, check whether that's long enough to include the field. This lets us probe the availability of the field once and for all during initialisation. That in turn allows ctx to become a const pointer to tcp_prepare_flags() which cascades through many other functions. We also move the flag for the probe result from the ctx structure to a global, to match peek_offset_cap. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 93 ++++++++++++++++++++++++++++++++++++-------------- tcp.h | 13 +++---- tcp_buf.c | 10 +++--- tcp_buf.h | 6 ++-- tcp_internal.h | 4 +-- 5 files changed, 82 insertions(+), 44 deletions(-) diff --git a/tcp.c b/tcp.c index 14b48a84..cba3f3bd 100644 --- a/tcp.c +++ b/tcp.c @@ -308,11 +308,6 @@ /* MSS rounding: see SET_MSS() */ #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */ -#ifdef HAS_SND_WND -# define KERNEL_REPORTS_SND_WND(c) ((c)->tcp.kernel_snd_wnd) -#else -# define KERNEL_REPORTS_SND_WND(c) (0 && (c)) -#endif #define ACK_INTERVAL 10 /* ms */ #define SYN_TIMEOUT 10 /* s */ @@ -370,6 +365,14 @@ char tcp_buf_discard [MAX_WINDOW]; /* Does the kernel support TCP_PEEK_OFF? */ bool peek_offset_cap; +#ifdef HAS_SND_WND +/* Does the kernel report sending window in TCP_INFO (kernel commit + * 8f7baad7f035) + */ +bool snd_wnd_cap; +#else +#define snd_wnd_cap (false) +#endif /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV]; @@ -1052,7 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, } #endif /* !HAS_BYTES_ACKED */ - if (!KERNEL_REPORTS_SND_WND(c)) { + if (!snd_wnd_cap) { tcp_get_sndbuf(conn); new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW); conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap, @@ -1136,7 +1139,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c, * 0 if there is no flag to send * 1 otherwise */ -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags, struct tcphdr *th, char *data, size_t *optlen) { @@ -1153,11 +1156,6 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, return -ECONNRESET; } -#ifdef HAS_SND_WND - if (!c->tcp.kernel_snd_wnd && tinfo.tcpi_snd_wnd) - c->tcp.kernel_snd_wnd = 1; -#endif - if (!(conn->flags & LOCAL)) tcp_rtt_dst_check(conn, &tinfo); @@ -1235,7 +1233,8 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, * * Return: negative error code on connection reset, 0 otherwise */ -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) +static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, + int flags) { return tcp_buf_send_flag(c, conn, flags); } @@ -1245,7 +1244,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) * @c: Execution context * @conn: Connection pointer */ -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn) +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn) { if (conn->events == CLOSED) return; @@ -1463,7 +1462,7 @@ static void tcp_bind_outbound(const struct ctx *c, * @optlen: Bytes in options: caller MUST ensure available length * @now: Current timestamp */ -static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, +static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, const void *saddr, const void *daddr, const struct tcphdr *th, const char *opts, size_t optlen, const struct timespec *now) @@ -1628,7 +1627,7 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq) * * #syscalls recvmsg */ -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) +static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { return tcp_buf_data_from_sock(c, conn); } @@ -1644,8 +1643,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) * * Return: count of consumed packets */ -static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, - const struct pool *p, int idx) +static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, + const struct pool *p, int idx) { int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0; uint16_t max_ack_seq_wnd = conn->wnd_from_tap; @@ -1842,7 +1841,8 @@ out: * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length */ -static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, +static void tcp_conn_from_sock_finish(const struct ctx *c, + struct tcp_tap_conn *conn, const struct tcphdr *th, const char *opts, size_t optlen) { @@ -1885,7 +1885,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, * * Return: count of consumed packets */ -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now) { @@ -2023,7 +2023,7 @@ reset: * @c: Execution context * @conn: Connection pointer */ -static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) +static void tcp_connect_finish(const struct ctx *c, struct tcp_tap_conn *conn) { socklen_t sl; int so; @@ -2049,8 +2049,8 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) * @sa: Peer socket address (from accept()) * @now: Current timestamp */ -static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, - const struct timespec *now) +static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow, + int s, const struct timespec *now) { struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); uint64_t hash; @@ -2081,7 +2081,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, * @ref: epoll reference of listening socket * @now: Current timestamp */ -void tcp_listen_handler(struct ctx *c, union epoll_ref ref, +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, const struct timespec *now) { const struct flowside *ini; @@ -2146,7 +2146,7 @@ cancel: * * #syscalls timerfd_gettime arm:timerfd_gettime64 i686:timerfd_gettime64 */ -void tcp_timer_handler(struct ctx *c, union epoll_ref ref) +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) { struct itimerspec check_armed = { { 0 }, { 0 } }; struct tcp_tap_conn *conn = &FLOW(ref.flow)->tcp; @@ -2210,7 +2210,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) * @ref: epoll reference * @events: epoll events bitmap */ -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events) { struct tcp_tap_conn *conn = conn_at_sidx(ref.flowside); @@ -2494,6 +2495,40 @@ static bool tcp_probe_peek_offset_cap(sa_family_t af) return ret; } +#ifdef HAS_SND_WND +/** + * tcp_probe_snd_wnd_cap() - Check if TCP_INFO reports tcpi_snd_wnd + * + * Return: true if supported, false otherwise + */ +static bool tcp_probe_snd_wnd_cap(void) +{ + struct tcp_info tinfo; + socklen_t sl = sizeof(tinfo); + int s; + + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); + if (s < 0) { + warn_perror("Temporary TCP socket creation failed"); + return false; + } + + if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) { + warn_perror("Failed to get TCP_INFO on temporary socket"); + close(s); + return false; + } + + close(s); + + if (sl < (offsetof(struct tcp_info, tcpi_snd_wnd) + + sizeof(tinfo.tcpi_snd_wnd))) + return false; + + return true; +} +#endif /* HAS_SND_WND */ + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2527,6 +2562,12 @@ int tcp_init(struct ctx *c) (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6)); debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); +#ifdef HAS_SND_WND + snd_wnd_cap = tcp_probe_snd_wnd_cap(); +#endif + debug("TCP_INFO tcpi_snd_wnd field%ssupported", + snd_wnd_cap ? " " : " not "); + return 0; } diff --git a/tcp.h b/tcp.h index e9ff0191..5585924f 100644 --- a/tcp.h +++ b/tcp.h @@ -10,11 +10,12 @@ struct ctx; -void tcp_timer_handler(struct ctx *c, union epoll_ref ref); -void tcp_listen_handler(struct ctx *c, union epoll_ref ref, +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref); +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, const struct timespec *now); -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events); +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, @@ -58,16 +59,12 @@ union tcp_listen_epoll_ref { * @fwd_in: Port forwarding configuration for inbound packets * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run - * @kernel_snd_wnd: Kernel reports sending window (with commit 8f7baad7f035) * @pipe_size: Size of pipes for spliced connections */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; -#ifdef HAS_SND_WND - int kernel_snd_wnd; -#endif size_t pipe_size; }; diff --git a/tcp_buf.c b/tcp_buf.c index 1a398461..c886c92b 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -239,7 +239,7 @@ void tcp_flags_flush(const struct ctx *c) * @frames: Two-dimensional array containing queued frames with sub-iovs * @num_frames: Number of entries in the two arrays to be compared */ -static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns, +static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, struct iovec (*frames)[TCP_NUM_IOVS], int num_frames) { int i; @@ -264,7 +264,7 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns, * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ -void tcp_payload_flush(struct ctx *c) +void tcp_payload_flush(const struct ctx *c) { size_t m; @@ -293,7 +293,7 @@ void tcp_payload_flush(struct ctx *c) * * Return: negative error code on connection reset, 0 otherwise */ -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) { struct tcp_flags_t *payload; struct iovec *iov; @@ -361,7 +361,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer * @seq: Sequence number to be sent */ -static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, +static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq) { struct iovec *iov; @@ -405,7 +405,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, * * #syscalls recvmsg */ -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; diff --git a/tcp_buf.h b/tcp_buf.h index 3db4c56e..8d4b615a 100644 --- a/tcp_buf.h +++ b/tcp_buf.h @@ -9,8 +9,8 @@ void tcp_sock4_iov_init(const struct ctx *c); void tcp_sock6_iov_init(const struct ctx *c); void tcp_flags_flush(const struct ctx *c); -void tcp_payload_flush(struct ctx *c); -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn); -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags); +void tcp_payload_flush(const struct ctx *c); +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn); +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags); #endif /*TCP_BUF_H */ diff --git a/tcp_internal.h b/tcp_internal.h index aa8bb64f..bd634be1 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -82,7 +82,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, conn_event_do(c, conn, event); \ } while (0) -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn); #define tcp_rst(c, conn) \ do { \ flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \ @@ -94,7 +94,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, const uint16_t *check, uint32_t seq); int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, int force_seq, struct tcp_info *tinfo); -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags, +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags, struct tcphdr *th, char *data, size_t *optlen); #endif /* TCP_INTERNAL_H */ -- 2.46.0
On Fri, 13 Sep 2024 14:32:06 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:When available, we want to retrieve our socket peer's advertised window and forward that to the guest. That information has been available from the kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035. Currently our probing for this is a bit odd. The HAS_SND_WND define determines if our headers include the tcp_snd_wnd field, but that doesn't necessarily mean the running kernel supports it. Currently we start by assuming it's _not_ available, but mark it as available if we ever see a non-zero value in the field. This is a bit hit and miss in two ways: * Zero is perfectly possible window the peer could report, so we can get false negativesKind of: one non-zero result was enough to set tcp.kernel_snd_wnd to one. The reason why I implemented it that way was to account for possible kernel backports of that option. On the other hand, any kernel backport would need to preserve the position of tcpi_snd_wnd, regardless of whether preceding fields are missing, so checking the size as you're doing also looks robust, and avoids these two issues altogether.* We're reading TCP_INFO into a local variable, which might not be zero initialised, so if the kernel _doesn't_ write it it could have non-zero garbage, giving us false positives. We can use a more direct way of probing for this: getsockopt() reports the length of the information retreived. So, check whether that's long enough to include the field. This lets us probe the availability of the field once and for all during initialisation. That in turn allows ctx to become a const pointer to tcp_prepare_flags() which cascades through many other functions. We also move the flag for the probe result from the ctx structure to a global, to match peek_offset_cap. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 93 ++++++++++++++++++++++++++++++++++++-------------- tcp.h | 13 +++---- tcp_buf.c | 10 +++--- tcp_buf.h | 6 ++-- tcp_internal.h | 4 +-- 5 files changed, 82 insertions(+), 44 deletions(-) diff --git a/tcp.c b/tcp.c index 14b48a84..cba3f3bd 100644 --- a/tcp.c +++ b/tcp.c @@ -308,11 +308,6 @@ /* MSS rounding: see SET_MSS() */ #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */ -#ifdef HAS_SND_WND -# define KERNEL_REPORTS_SND_WND(c) ((c)->tcp.kernel_snd_wnd) -#else -# define KERNEL_REPORTS_SND_WND(c) (0 && (c)) -#endif #define ACK_INTERVAL 10 /* ms */ #define SYN_TIMEOUT 10 /* s */ @@ -370,6 +365,14 @@ char tcp_buf_discard [MAX_WINDOW]; /* Does the kernel support TCP_PEEK_OFF? */ bool peek_offset_cap; +#ifdef HAS_SND_WND +/* Does the kernel report sending window in TCP_INFO (kernel commit + * 8f7baad7f035) + */ +bool snd_wnd_cap; +#else +#define snd_wnd_cap (false) +#endif /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV]; @@ -1052,7 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, } #endif /* !HAS_BYTES_ACKED */ - if (!KERNEL_REPORTS_SND_WND(c)) { + if (!snd_wnd_cap) { tcp_get_sndbuf(conn); new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW); conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap, @@ -1136,7 +1139,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c, * 0 if there is no flag to send * 1 otherwise */ -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags, struct tcphdr *th, char *data, size_t *optlen) { @@ -1153,11 +1156,6 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, return -ECONNRESET; } -#ifdef HAS_SND_WND - if (!c->tcp.kernel_snd_wnd && tinfo.tcpi_snd_wnd) - c->tcp.kernel_snd_wnd = 1; -#endif - if (!(conn->flags & LOCAL)) tcp_rtt_dst_check(conn, &tinfo); @@ -1235,7 +1233,8 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, * * Return: negative error code on connection reset, 0 otherwise */ -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) +static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, + int flags) { return tcp_buf_send_flag(c, conn, flags); } @@ -1245,7 +1244,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) * @c: Execution context * @conn: Connection pointer */ -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn) +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn) { if (conn->events == CLOSED) return; @@ -1463,7 +1462,7 @@ static void tcp_bind_outbound(const struct ctx *c, * @optlen: Bytes in options: caller MUST ensure available length * @now: Current timestamp */ -static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, +static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, const void *saddr, const void *daddr, const struct tcphdr *th, const char *opts, size_t optlen, const struct timespec *now) @@ -1628,7 +1627,7 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq) * * #syscalls recvmsg */ -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) +static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { return tcp_buf_data_from_sock(c, conn); } @@ -1644,8 +1643,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) * * Return: count of consumed packets */ -static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, - const struct pool *p, int idx) +static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, + const struct pool *p, int idx) { int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0; uint16_t max_ack_seq_wnd = conn->wnd_from_tap; @@ -1842,7 +1841,8 @@ out: * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length */ -static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, +static void tcp_conn_from_sock_finish(const struct ctx *c, + struct tcp_tap_conn *conn, const struct tcphdr *th, const char *opts, size_t optlen) { @@ -1885,7 +1885,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, * * Return: count of consumed packets */ -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now) { @@ -2023,7 +2023,7 @@ reset: * @c: Execution context * @conn: Connection pointer */ -static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) +static void tcp_connect_finish(const struct ctx *c, struct tcp_tap_conn *conn) { socklen_t sl; int so; @@ -2049,8 +2049,8 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) * @sa: Peer socket address (from accept()) * @now: Current timestamp */ -static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, - const struct timespec *now) +static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow, + int s, const struct timespec *now) { struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); uint64_t hash; @@ -2081,7 +2081,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, * @ref: epoll reference of listening socket * @now: Current timestamp */ -void tcp_listen_handler(struct ctx *c, union epoll_ref ref, +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, const struct timespec *now) { const struct flowside *ini; @@ -2146,7 +2146,7 @@ cancel: * * #syscalls timerfd_gettime arm:timerfd_gettime64 i686:timerfd_gettime64 */ -void tcp_timer_handler(struct ctx *c, union epoll_ref ref) +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) { struct itimerspec check_armed = { { 0 }, { 0 } }; struct tcp_tap_conn *conn = &FLOW(ref.flow)->tcp; @@ -2210,7 +2210,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) * @ref: epoll reference * @events: epoll events bitmap */ -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events) { struct tcp_tap_conn *conn = conn_at_sidx(ref.flowside); @@ -2494,6 +2495,40 @@ static bool tcp_probe_peek_offset_cap(sa_family_t af) return ret; } +#ifdef HAS_SND_WND +/** + * tcp_probe_snd_wnd_cap() - Check if TCP_INFO reports tcpi_snd_wnd + * + * Return: true if supported, false otherwise + */ +static bool tcp_probe_snd_wnd_cap(void) +{ + struct tcp_info tinfo; + socklen_t sl = sizeof(tinfo); + int s; + + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); + if (s < 0) { + warn_perror("Temporary TCP socket creation failed"); + return false; + } + + if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) { + warn_perror("Failed to get TCP_INFO on temporary socket"); + close(s); + return false; + } + + close(s); + + if (sl < (offsetof(struct tcp_info, tcpi_snd_wnd) + + sizeof(tinfo.tcpi_snd_wnd))) + return false; + + return true; +} +#endif /* HAS_SND_WND */ + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2527,6 +2562,12 @@ int tcp_init(struct ctx *c) (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6)); debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); +#ifdef HAS_SND_WND + snd_wnd_cap = tcp_probe_snd_wnd_cap(); +#endif + debug("TCP_INFO tcpi_snd_wnd field%ssupported", + snd_wnd_cap ? " " : " not "); + return 0; } diff --git a/tcp.h b/tcp.h index e9ff0191..5585924f 100644 --- a/tcp.h +++ b/tcp.h @@ -10,11 +10,12 @@ struct ctx; -void tcp_timer_handler(struct ctx *c, union epoll_ref ref); -void tcp_listen_handler(struct ctx *c, union epoll_ref ref, +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref); +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, const struct timespec *now); -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events); +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, @@ -58,16 +59,12 @@ union tcp_listen_epoll_ref { * @fwd_in: Port forwarding configuration for inbound packets * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run - * @kernel_snd_wnd: Kernel reports sending window (with commit 8f7baad7f035) * @pipe_size: Size of pipes for spliced connections */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; -#ifdef HAS_SND_WND - int kernel_snd_wnd; -#endif size_t pipe_size; }; diff --git a/tcp_buf.c b/tcp_buf.c index 1a398461..c886c92b 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -239,7 +239,7 @@ void tcp_flags_flush(const struct ctx *c) * @frames: Two-dimensional array containing queued frames with sub-iovs * @num_frames: Number of entries in the two arrays to be compared */ -static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns, +static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, struct iovec (*frames)[TCP_NUM_IOVS], int num_frames) { int i; @@ -264,7 +264,7 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns, * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ -void tcp_payload_flush(struct ctx *c) +void tcp_payload_flush(const struct ctx *c) { size_t m; @@ -293,7 +293,7 @@ void tcp_payload_flush(struct ctx *c) * * Return: negative error code on connection reset, 0 otherwise */ -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) { struct tcp_flags_t *payload; struct iovec *iov; @@ -361,7 +361,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer * @seq: Sequence number to be sent */ -static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, +static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq) { struct iovec *iov; @@ -405,7 +405,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, * * #syscalls recvmsg */ -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; diff --git a/tcp_buf.h b/tcp_buf.h index 3db4c56e..8d4b615a 100644 --- a/tcp_buf.h +++ b/tcp_buf.h @@ -9,8 +9,8 @@ void tcp_sock4_iov_init(const struct ctx *c); void tcp_sock6_iov_init(const struct ctx *c); void tcp_flags_flush(const struct ctx *c); -void tcp_payload_flush(struct ctx *c); -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn); -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags); +void tcp_payload_flush(const struct ctx *c); +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn); +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags); #endif /*TCP_BUF_H */ diff --git a/tcp_internal.h b/tcp_internal.h index aa8bb64f..bd634be1 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -82,7 +82,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, conn_event_do(c, conn, event); \ } while (0) -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn); #define tcp_rst(c, conn) \ do { \ flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \ @@ -94,7 +94,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, const uint16_t *check, uint32_t seq); int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, int force_seq, struct tcp_info *tinfo); -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags, +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags, struct tcphdr *th, char *data, size_t *optlen); #endif /* TCP_INTERNAL_H */-- Stefano
On Tue, Sep 17, 2024 at 11:54:28PM +0200, Stefano Brivio wrote:On Fri, 13 Sep 2024 14:32:06 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, but that in turn means it could change at some later point rather than being correct from the beginning, which (slightly) complicates everything.When available, we want to retrieve our socket peer's advertised window and forward that to the guest. That information has been available from the kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035. Currently our probing for this is a bit odd. The HAS_SND_WND define determines if our headers include the tcp_snd_wnd field, but that doesn't necessarily mean the running kernel supports it. Currently we start by assuming it's _not_ available, but mark it as available if we ever see a non-zero value in the field. This is a bit hit and miss in two ways: * Zero is perfectly possible window the peer could report, so we can get false negativesKind of: one non-zero result was enough to set tcp.kernel_snd_wnd to one.The reason why I implemented it that way was to account for possible kernel backports of that option. On the other hand, any kernel backport would need to preserve the position of tcpi_snd_wnd, regardless of whether preceding fields are missing, so checking the size as you're doing also looks robust, and avoids these two issues altogether.Right. My understanding is that the ability to check the returned size is exactly why adding more fields to a getsockopt() is considered a backwards compatible change.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson* We're reading TCP_INFO into a local variable, which might not be zero initialised, so if the kernel _doesn't_ write it it could have non-zero garbage, giving us false positives. We can use a more direct way of probing for this: getsockopt() reports the length of the information retreived. So, check whether that's long enough to include the field. This lets us probe the availability of the field once and for all during initialisation. That in turn allows ctx to become a const pointer to tcp_prepare_flags() which cascades through many other functions. We also move the flag for the probe result from the ctx structure to a global, to match peek_offset_cap. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 93 ++++++++++++++++++++++++++++++++++++-------------- tcp.h | 13 +++---- tcp_buf.c | 10 +++--- tcp_buf.h | 6 ++-- tcp_internal.h | 4 +-- 5 files changed, 82 insertions(+), 44 deletions(-) diff --git a/tcp.c b/tcp.c index 14b48a84..cba3f3bd 100644 --- a/tcp.c +++ b/tcp.c @@ -308,11 +308,6 @@ /* MSS rounding: see SET_MSS() */ #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */ -#ifdef HAS_SND_WND -# define KERNEL_REPORTS_SND_WND(c) ((c)->tcp.kernel_snd_wnd) -#else -# define KERNEL_REPORTS_SND_WND(c) (0 && (c)) -#endif #define ACK_INTERVAL 10 /* ms */ #define SYN_TIMEOUT 10 /* s */ @@ -370,6 +365,14 @@ char tcp_buf_discard [MAX_WINDOW]; /* Does the kernel support TCP_PEEK_OFF? */ bool peek_offset_cap; +#ifdef HAS_SND_WND +/* Does the kernel report sending window in TCP_INFO (kernel commit + * 8f7baad7f035) + */ +bool snd_wnd_cap; +#else +#define snd_wnd_cap (false) +#endif /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV]; @@ -1052,7 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, } #endif /* !HAS_BYTES_ACKED */ - if (!KERNEL_REPORTS_SND_WND(c)) { + if (!snd_wnd_cap) { tcp_get_sndbuf(conn); new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW); conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap, @@ -1136,7 +1139,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c, * 0 if there is no flag to send * 1 otherwise */ -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags, struct tcphdr *th, char *data, size_t *optlen) { @@ -1153,11 +1156,6 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, return -ECONNRESET; } -#ifdef HAS_SND_WND - if (!c->tcp.kernel_snd_wnd && tinfo.tcpi_snd_wnd) - c->tcp.kernel_snd_wnd = 1; -#endif - if (!(conn->flags & LOCAL)) tcp_rtt_dst_check(conn, &tinfo); @@ -1235,7 +1233,8 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, * * Return: negative error code on connection reset, 0 otherwise */ -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) +static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, + int flags) { return tcp_buf_send_flag(c, conn, flags); } @@ -1245,7 +1244,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) * @c: Execution context * @conn: Connection pointer */ -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn) +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn) { if (conn->events == CLOSED) return; @@ -1463,7 +1462,7 @@ static void tcp_bind_outbound(const struct ctx *c, * @optlen: Bytes in options: caller MUST ensure available length * @now: Current timestamp */ -static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, +static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, const void *saddr, const void *daddr, const struct tcphdr *th, const char *opts, size_t optlen, const struct timespec *now) @@ -1628,7 +1627,7 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq) * * #syscalls recvmsg */ -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) +static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { return tcp_buf_data_from_sock(c, conn); } @@ -1644,8 +1643,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) * * Return: count of consumed packets */ -static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, - const struct pool *p, int idx) +static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, + const struct pool *p, int idx) { int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0; uint16_t max_ack_seq_wnd = conn->wnd_from_tap; @@ -1842,7 +1841,8 @@ out: * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length */ -static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, +static void tcp_conn_from_sock_finish(const struct ctx *c, + struct tcp_tap_conn *conn, const struct tcphdr *th, const char *opts, size_t optlen) { @@ -1885,7 +1885,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, * * Return: count of consumed packets */ -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now) { @@ -2023,7 +2023,7 @@ reset: * @c: Execution context * @conn: Connection pointer */ -static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) +static void tcp_connect_finish(const struct ctx *c, struct tcp_tap_conn *conn) { socklen_t sl; int so; @@ -2049,8 +2049,8 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) * @sa: Peer socket address (from accept()) * @now: Current timestamp */ -static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, - const struct timespec *now) +static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow, + int s, const struct timespec *now) { struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); uint64_t hash; @@ -2081,7 +2081,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, * @ref: epoll reference of listening socket * @now: Current timestamp */ -void tcp_listen_handler(struct ctx *c, union epoll_ref ref, +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, const struct timespec *now) { const struct flowside *ini; @@ -2146,7 +2146,7 @@ cancel: * * #syscalls timerfd_gettime arm:timerfd_gettime64 i686:timerfd_gettime64 */ -void tcp_timer_handler(struct ctx *c, union epoll_ref ref) +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) { struct itimerspec check_armed = { { 0 }, { 0 } }; struct tcp_tap_conn *conn = &FLOW(ref.flow)->tcp; @@ -2210,7 +2210,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) * @ref: epoll reference * @events: epoll events bitmap */ -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events) { struct tcp_tap_conn *conn = conn_at_sidx(ref.flowside); @@ -2494,6 +2495,40 @@ static bool tcp_probe_peek_offset_cap(sa_family_t af) return ret; } +#ifdef HAS_SND_WND +/** + * tcp_probe_snd_wnd_cap() - Check if TCP_INFO reports tcpi_snd_wnd + * + * Return: true if supported, false otherwise + */ +static bool tcp_probe_snd_wnd_cap(void) +{ + struct tcp_info tinfo; + socklen_t sl = sizeof(tinfo); + int s; + + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); + if (s < 0) { + warn_perror("Temporary TCP socket creation failed"); + return false; + } + + if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) { + warn_perror("Failed to get TCP_INFO on temporary socket"); + close(s); + return false; + } + + close(s); + + if (sl < (offsetof(struct tcp_info, tcpi_snd_wnd) + + sizeof(tinfo.tcpi_snd_wnd))) + return false; + + return true; +} +#endif /* HAS_SND_WND */ + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2527,6 +2562,12 @@ int tcp_init(struct ctx *c) (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6)); debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); +#ifdef HAS_SND_WND + snd_wnd_cap = tcp_probe_snd_wnd_cap(); +#endif + debug("TCP_INFO tcpi_snd_wnd field%ssupported", + snd_wnd_cap ? " " : " not "); + return 0; } diff --git a/tcp.h b/tcp.h index e9ff0191..5585924f 100644 --- a/tcp.h +++ b/tcp.h @@ -10,11 +10,12 @@ struct ctx; -void tcp_timer_handler(struct ctx *c, union epoll_ref ref); -void tcp_listen_handler(struct ctx *c, union epoll_ref ref, +void tcp_timer_handler(const struct ctx *c, union epoll_ref ref); +void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, const struct timespec *now); -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); -int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, +void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events); +int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, @@ -58,16 +59,12 @@ union tcp_listen_epoll_ref { * @fwd_in: Port forwarding configuration for inbound packets * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run - * @kernel_snd_wnd: Kernel reports sending window (with commit 8f7baad7f035) * @pipe_size: Size of pipes for spliced connections */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; -#ifdef HAS_SND_WND - int kernel_snd_wnd; -#endif size_t pipe_size; }; diff --git a/tcp_buf.c b/tcp_buf.c index 1a398461..c886c92b 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -239,7 +239,7 @@ void tcp_flags_flush(const struct ctx *c) * @frames: Two-dimensional array containing queued frames with sub-iovs * @num_frames: Number of entries in the two arrays to be compared */ -static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns, +static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, struct iovec (*frames)[TCP_NUM_IOVS], int num_frames) { int i; @@ -264,7 +264,7 @@ static void tcp_revert_seq(struct ctx *c, struct tcp_tap_conn **conns, * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ -void tcp_payload_flush(struct ctx *c) +void tcp_payload_flush(const struct ctx *c) { size_t m; @@ -293,7 +293,7 @@ void tcp_payload_flush(struct ctx *c) * * Return: negative error code on connection reset, 0 otherwise */ -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) { struct tcp_flags_t *payload; struct iovec *iov; @@ -361,7 +361,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer * @seq: Sequence number to be sent */ -static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, +static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq) { struct iovec *iov; @@ -405,7 +405,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, * * #syscalls recvmsg */ -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; diff --git a/tcp_buf.h b/tcp_buf.h index 3db4c56e..8d4b615a 100644 --- a/tcp_buf.h +++ b/tcp_buf.h @@ -9,8 +9,8 @@ void tcp_sock4_iov_init(const struct ctx *c); void tcp_sock6_iov_init(const struct ctx *c); void tcp_flags_flush(const struct ctx *c); -void tcp_payload_flush(struct ctx *c); -int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn); -int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags); +void tcp_payload_flush(const struct ctx *c); +int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn); +int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags); #endif /*TCP_BUF_H */ diff --git a/tcp_internal.h b/tcp_internal.h index aa8bb64f..bd634be1 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -82,7 +82,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, conn_event_do(c, conn, event); \ } while (0) -void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); +void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn); #define tcp_rst(c, conn) \ do { \ flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \ @@ -94,7 +94,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, const uint16_t *check, uint32_t seq); int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, int force_seq, struct tcp_info *tinfo); -int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flags, +int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags, struct tcphdr *th, char *data, size_t *optlen); #endif /* TCP_INTERNAL_H */
This function has a block conditional on !snd_wnd_cap shortly before an #ifdef HAS_SND_WND. But snd_wnd_cap implies HAS_SND_WND (if !HAS_SND_WND, snd_wnd_cap is statically false). Therefore, simplify this down to a single conditional with an else branch. While we're there, fix some improperly indented closing braces. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 36 +++++++++++++++++------------------- 1 file changed, 17 insertions(+), 19 deletions(-) diff --git a/tcp.c b/tcp.c index cba3f3bd..6733e7e3 100644 --- a/tcp.c +++ b/tcp.c @@ -1061,27 +1061,25 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap, USHRT_MAX); goto out; - } - - if (!tinfo) { - if (prev_wnd_to_tap > WINDOW_DEFAULT) { - goto out; -} - tinfo = &tinfo_new; - if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) { - goto out; -} - } - -#ifdef HAS_SND_WND - if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) { - new_wnd_to_tap = tinfo->tcpi_snd_wnd; } else { - tcp_get_sndbuf(conn); - new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, - SNDBUF_GET(conn)); + if (!tinfo) { + if (prev_wnd_to_tap > WINDOW_DEFAULT) { + goto out; + } + tinfo = &tinfo_new; + if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) { + goto out; + } + } + + if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) { + new_wnd_to_tap = tinfo->tcpi_snd_wnd; + } else { + tcp_get_sndbuf(conn); + new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, + SNDBUF_GET(conn)); + } } -#endif new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW); if (!(conn->events & ESTABLISHED)) -- 2.46.0
On Fri, 13 Sep 2024 14:32:07 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:This function has a block conditional on !snd_wnd_cap shortly before an #ifdef HAS_SND_WND. But snd_wnd_cap implies HAS_SND_WND (if !HAS_SND_WND, snd_wnd_cap is statically false). Therefore, simplify this down to a single conditional with an else branch. While we're there, fix some improperly indented closing braces. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 36 +++++++++++++++++------------------- 1 file changed, 17 insertions(+), 19 deletions(-) diff --git a/tcp.c b/tcp.c index cba3f3bd..6733e7e3 100644 --- a/tcp.c +++ b/tcp.c @@ -1061,27 +1061,25 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap, USHRT_MAX); goto out; - } - - if (!tinfo) { - if (prev_wnd_to_tap > WINDOW_DEFAULT) { - goto out; -} - tinfo = &tinfo_new; - if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) { - goto out; -} - } - -#ifdef HAS_SND_WND - if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) { - new_wnd_to_tap = tinfo->tcpi_snd_wnd; } else {I thought cppcheck would report else-after-goto, but it doesn't, just else-after-return. In any case, we could simplify further by avoid that else clause (and one level of indentation) in the whole block below. It would also look more natural to me: we deal with if (!snd_wnd_cap) as a special case and go to 'out' in that special case, then we resume with the regular path. I guess this is better than the original anyway and it's not a strong preference, so I can also apply this up to 4/10 as it is (or fix up on merge). The rest of the patches up to 4/10 look good to me.- tcp_get_sndbuf(conn); - new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, - SNDBUF_GET(conn)); + if (!tinfo) { + if (prev_wnd_to_tap > WINDOW_DEFAULT) { + goto out; + } + tinfo = &tinfo_new; + if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) { + goto out; + } + } + + if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) { + new_wnd_to_tap = tinfo->tcpi_snd_wnd; + } else { + tcp_get_sndbuf(conn); + new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, + SNDBUF_GET(conn)); + } } -#endif new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW); if (!(conn->events & ESTABLISHED))-- Stefano
On Tue, Sep 17, 2024 at 11:54:34PM +0200, Stefano Brivio wrote:On Fri, 13 Sep 2024 14:32:07 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Good idea, done.This function has a block conditional on !snd_wnd_cap shortly before an #ifdef HAS_SND_WND. But snd_wnd_cap implies HAS_SND_WND (if !HAS_SND_WND, snd_wnd_cap is statically false). Therefore, simplify this down to a single conditional with an else branch. While we're there, fix some improperly indented closing braces. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 36 +++++++++++++++++------------------- 1 file changed, 17 insertions(+), 19 deletions(-) diff --git a/tcp.c b/tcp.c index cba3f3bd..6733e7e3 100644 --- a/tcp.c +++ b/tcp.c @@ -1061,27 +1061,25 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, conn->wnd_to_tap = MIN(new_wnd_to_tap >> conn->ws_to_tap, USHRT_MAX); goto out; - } - - if (!tinfo) { - if (prev_wnd_to_tap > WINDOW_DEFAULT) { - goto out; -} - tinfo = &tinfo_new; - if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) { - goto out; -} - } - -#ifdef HAS_SND_WND - if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) { - new_wnd_to_tap = tinfo->tcpi_snd_wnd; } else {I thought cppcheck would report else-after-goto, but it doesn't, just else-after-return. In any case, we could simplify further by avoid that else clause (and one level of indentation) in the whole block below.It would also look more natural to me: we deal with if (!snd_wnd_cap) as a special case and go to 'out' in that special case, then we resume with the regular path. I guess this is better than the original anyway and it's not a strong preference, so I can also apply this up to 4/10 as it is (or fix up on merge). The rest of the patches up to 4/10 look good to me.Well, I've made the change now, so I might as well repost :).-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson- tcp_get_sndbuf(conn); - new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, - SNDBUF_GET(conn)); + if (!tinfo) { + if (prev_wnd_to_tap > WINDOW_DEFAULT) { + goto out; + } + tinfo = &tinfo_new; + if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) { + goto out; + } + } + + if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) { + new_wnd_to_tap = tinfo->tcpi_snd_wnd; + } else { + tcp_get_sndbuf(conn); + new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, + SNDBUF_GET(conn)); + } } -#endif new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW); if (!(conn->events & ESTABLISHED))
This parameter is already treated as a boolean internally. Make it a 'bool' type for clarity. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 6 +++--- tcp_buf.c | 2 +- tcp_internal.h | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 6733e7e3..ee894c5c 100644 --- a/tcp.c +++ b/tcp.c @@ -1020,7 +1020,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, * Return: 1 if sequence or window were updated, 0 otherwise */ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, - int force_seq, struct tcp_info *tinfo) + bool force_seq, struct tcp_info *tinfo) { uint32_t prev_wnd_to_tap = conn->wnd_to_tap << conn->ws_to_tap; uint32_t prev_ack_to_tap = conn->seq_ack_to_tap; @@ -1157,7 +1157,7 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, if (!(conn->flags & LOCAL)) tcp_rtt_dst_check(conn, &tinfo); - if (!tcp_update_seqack_wnd(c, conn, flags, &tinfo) && !flags) + if (!tcp_update_seqack_wnd(c, conn, !!flags, &tinfo) && !flags) return 0; *optlen = 0; @@ -2240,7 +2240,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, tcp_data_from_sock(c, conn); if (events & EPOLLOUT) - tcp_update_seqack_wnd(c, conn, 0, NULL); + tcp_update_seqack_wnd(c, conn, false, NULL); return; } diff --git a/tcp_buf.c b/tcp_buf.c index c886c92b..83f91a37 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -511,7 +511,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) last_len = sendlen - (send_bufs - 1) * mss; /* Likely, some new data was acked too. */ - tcp_update_seqack_wnd(c, conn, 0, NULL); + tcp_update_seqack_wnd(c, conn, false, NULL); /* Finally, queue to tap */ dlen = mss; diff --git a/tcp_internal.h b/tcp_internal.h index bd634be1..a450d850 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -93,7 +93,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq); int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, - int force_seq, struct tcp_info *tinfo); + bool force_seq, struct tcp_info *tinfo); int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn, int flags, struct tcphdr *th, char *data, size_t *optlen); -- 2.46.0
When the guest sends data on a TCP connection, it's possible that we run out of buffer space on the socket side. If that happens we'll advertise a zero window to the guest stopping it from sending. When the socket side buffer clears again, which is signalled by an EPOLLOUT, we recalculate the ack and window with tcp_update_seqack_wnd(). tcp_update_seqack_wnd() only calculates the new values, it doesn't actually send out the ack - that will typically happen some milliseconds later on the ACK_TO_TAP_DUE timer. AFAICT, there's not really any point delaying this ack though, we might as well send it and get the guest sending again ASAP. So, instead of calling tcp_update_seqack_wnd() call tcp_send_flag() to send an immediate ACK if needed. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index ee894c5c..32e45e09 100644 --- a/tcp.c +++ b/tcp.c @@ -2240,7 +2240,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, tcp_data_from_sock(c, conn); if (events & EPOLLOUT) - tcp_update_seqack_wnd(c, conn, false, NULL); + tcp_send_flag(c, conn, ACK_IF_NEEDED); return; } -- 2.46.0
Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only used level-triggered events for the tap device. Prior to that we used it inconsistently which was confusing (though not incorrect AFAICT). We want to add support for EPOLLOUT events on the tap connection, and without EPOLLET that would require toggling EPOLLOUT on and off, which is awkward. So, re-introduce EPOLLET, but now use it uniformly for all tap modes. The main change this requires is making sure on EPOLLIN we loop until all there's no more data to process. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 34 +++++++++++++++++++++++----------- 1 file changed, 23 insertions(+), 11 deletions(-) diff --git a/tap.c b/tap.c index 41af6a6d..c1db2960 100644 --- a/tap.c +++ b/tap.c @@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c) * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_passt_input(struct ctx *c, const struct timespec *now) +static bool tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; @@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); } - return; + return false; } p = pkt_buf; @@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { err("Bad frame size from guest, resetting connection"); tap_sock_reset(c); - return; + return false; } if (l2len + sizeof(uint32_t) > (size_t)n) @@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) partial_frame = p; tap_handler(c, now); + + return true; } /** @@ -1061,16 +1065,20 @@ void tap_handler_passt(struct ctx *c, uint32_t events, return; } - if (events & EPOLLIN) - tap_passt_input(c, now); + if (events & EPOLLIN) { + while (tap_passt_input(c, now)) + ; + } } /** * tap_pasta_input() - Handler for new data on the socket to hypervisor * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_pasta_input(struct ctx *c, const struct timespec *now) +static bool tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; @@ -1102,6 +1110,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) } tap_handler(c, now); + + return len > 0; } /** @@ -1116,8 +1126,10 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) die("Disconnect event on /dev/net/tun device, exiting"); - if (events & EPOLLIN) - tap_pasta_input(c, now); + if (events & EPOLLIN) { + while (tap_pasta_input(c, now)) + ; + } } /** @@ -1251,7 +1263,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events) trace("tap: failed to set SO_SNDBUF to %i", v); ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1307,7 +1319,7 @@ static void tap_sock_tun_init(struct ctx *c) pasta_ns_conf(c); ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1340,7 +1352,7 @@ void tap_sock_init(struct ctx *c) else ref.type = EPOLL_TYPE_TAP_PASTA; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); return; -- 2.46.0
It's possible for the buffers of the tap interface (whatever style) to fill up, at which point we'll get some sort of short write. This can result in TCP connections with buffered data on the socket side we weren't able to forward. To more efficiently know when we can forward that data we need to know when the tap interface is no longer "full". To assist that, keep track of our best estimate of whether the tap device is full: set it when we get a short write, and clear it when we get an EPOLLOUT event on the tap fd, indicating we can write more data. Protocols (specifically TCP) can use this via a new tap_is_full() call. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 31 +++++++++++++++++++++++++++---- tap.h | 1 + 2 files changed, 28 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index c1db2960..3bdca9a1 100644 --- a/tap.c +++ b/tap.c @@ -63,6 +63,9 @@ static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf); static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf); +/* Filled buffers on the tap device */ +static bool tap_full_flag; + #define TAP_SEQS 128 /* Different L4 tuples in one batch */ #define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */ @@ -411,9 +414,11 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, else m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes); - if (m < nframes) + if (m < nframes) { + tap_full_flag = true; debug("tap: failed to send %zu frames of %zu", nframes - m, nframes); + } pcap_multiple(iov, bufs_per_frame, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0); @@ -1069,6 +1074,9 @@ void tap_handler_passt(struct ctx *c, uint32_t events, while (tap_passt_input(c, now)) ; } + + if (events & EPOLLOUT) + tap_full_flag = false; } /** @@ -1130,6 +1138,21 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, while (tap_pasta_input(c, now)) ; } + + if (events & EPOLLOUT) + tap_full_flag = false; +} + +/** + * tap_is_full() - Can we write more to the tap device this cycle? + * + * Return: true if (last we knew) there's more space in the tap buffers, false + * otherwise + */ +/* cppcheck-suppress unusedFunction */ +bool tap_is_full(void) +{ + return tap_full_flag; } /** @@ -1263,7 +1286,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events) trace("tap: failed to set SO_SNDBUF to %i", v); ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; + ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1319,7 +1342,7 @@ static void tap_sock_tun_init(struct ctx *c) pasta_ns_conf(c); ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; + ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1352,7 +1375,7 @@ void tap_sock_init(struct ctx *c) else ref.type = EPOLL_TYPE_TAP_PASTA; - ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; + ev.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); return; diff --git a/tap.h b/tap.h index ec9e2ace..6094fbdd 100644 --- a/tap.h +++ b/tap.h @@ -67,6 +67,7 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, const struct timespec *now); void tap_handler_passt(struct ctx *c, uint32_t events, const struct timespec *now); +bool tap_is_full(void); int tap_sock_unix_open(char *sock_path); void tap_sock_init(struct ctx *c); void tap_flush_pools(void); -- 2.46.0
When we receive new data on a TCP socket, we attempt to forward all of it to the tap interface immediately. However, if the tap buffers fill up, we might fail to transfer some of it. In that case we'll need to try again later. Currently that's handled by the EPOLLIN event being re-asserted in level mode at some point by complicated fiddling of the event masks. In preparation for a simpler way of triggering the retry, keep track of which connections are in this state with a new TAP_FULL flag. We also keep a count of the total number of connections currently in the TAP_FULL state. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 12 +++++++++++- tcp_buf.c | 3 +++ tcp_conn.h | 1 + 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index 32e45e09..4b478432 100644 --- a/tcp.c +++ b/tcp.c @@ -349,7 +349,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = { static const char *tcp_flag_str[] __attribute((__unused__)) = { "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", - "ACK_FROM_TAP_DUE", + "ACK_FROM_TAP_DUE", "TAP_FULL", }; /* Listening sockets, used for automatic port forwarding in pasta mode only */ @@ -381,6 +381,11 @@ static struct iovec tcp_iov [UIO_MAXIOV]; int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; +/* Number of connections with pending socket data that couldn't be sent because + * we ran out of buffer space on the tap side + */ +unsigned num_tap_full; + /** * conn_at_sidx() - Get TCP connection specific flow at given sidx * @sidx: Flow and side to retrieve @@ -597,6 +602,11 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, (flag == ~ACK_FROM_TAP_DUE && (conn->flags & ACK_TO_TAP_DUE)) || (flag == ~ACK_TO_TAP_DUE && (conn->flags & ACK_FROM_TAP_DUE))) tcp_timer_ctl(c, conn); + + if (flag == TAP_FULL) + num_tap_full++; + else if (flag == ~TAP_FULL) + num_tap_full--; } /** diff --git a/tcp_buf.c b/tcp_buf.c index 83f91a37..0ccb9e6b 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -250,6 +250,8 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, uint32_t seq = ntohl(th->seq); uint32_t peek_offset; + conn_flag(c, conn, TAP_FULL); + if (SEQ_LE(conn->seq_to_tap, seq)) continue; @@ -526,6 +528,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) seq += dlen; } + conn_flag(c, conn, ~TAP_FULL); conn_flag(c, conn, ACK_FROM_TAP_DUE); return 0; diff --git a/tcp_conn.h b/tcp_conn.h index 6ae05115..9677678c 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -77,6 +77,7 @@ struct tcp_tap_conn { #define ACTIVE_CLOSE BIT(2) #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4) +#define TAP_FULL BIT(5) #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; -- 2.46.0
Future changes will want these functions to call things that are currently below them in the file, so move them later. Code motion only, no change to logic for now. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 58 +++++++++++++++++++++++++++++----------------------------- 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/tcp.c b/tcp.c index 4b478432..78f546db 100644 --- a/tcp.c +++ b/tcp.c @@ -853,35 +853,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections) - * @conn: Connection to handle - * - * Return: true if the connection is ready to free, false otherwise - */ -bool tcp_flow_defer(const struct tcp_tap_conn *conn) -{ - if (conn->events != CLOSED) - return false; - - close(conn->sock); - if (conn->timer != -1) - close(conn->timer); - - return true; -} - -/** - * tcp_defer_handler() - Handler for TCP deferred tasks - * @c: Execution context - */ -/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */ -void tcp_defer_handler(struct ctx *c) -{ - tcp_flags_flush(c); - tcp_payload_flush(c); -} - /** * tcp_fill_header() - Fill the TCP header fields for a given TCP segment. * @@ -2272,6 +2243,35 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, } } +/** + * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections) + * @conn: Connection to handle + * + * Return: true if the connection is ready to free, false otherwise + */ +bool tcp_flow_defer(const struct tcp_tap_conn *conn) +{ + if (conn->events != CLOSED) + return false; + + close(conn->sock); + if (conn->timer != -1) + close(conn->timer); + + return true; +} + +/** + * tcp_defer_handler() - Handler for TCP deferred tasks + * @c: Execution context + */ +/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */ +void tcp_defer_handler(struct ctx *c) +{ + tcp_flags_flush(c); + tcp_payload_flush(c); +} + /** * tcp_sock_init_af() - Initialise listening socket for a given af and port * @c: Execution context -- 2.46.0
tcp can use a number of different epoll event masks, and even switches between edge-triggered and level-triggered mode depending on a number of factors. This makes it very difficult to reason about exactly what events will be active when. Simplify this by instead always using event mask with all the events we ever care about (EPOLLIN, EPOLLOUT and EPOLLRDHUP). We have a number of situations where we can't immediately clear the event condition, so we must always use edge-triggered mode to avoid spinning on events. We do need to make some changes to properly handle edge-triggered mode. Specifically there are two cases where we might not be able to process all new data on an EPOLLIN, so we need to make sure we repoll the socket for queued data later. 1) If we can't forward all data to the guest, because the tap-side window is too small. To address this we repoll the socket for data when we receive an ack or window update from the guest side. 2) If we run out of buffer space in the tap device itself. We handle this by flagging connections in this state. We then recheck them at the end of each epoll cycle if the tap device is now reporting ready for more data. tcp_defer_handler() is called after the last round of epoll events is handled. Add logic here to revisit any connections which were previously blocked due to running out of buffer space on the tap device, and attempt to forward the data again. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 1 - tcp.c | 58 ++++++++++++++++++++++++++-------------------------------- 2 files changed, 26 insertions(+), 33 deletions(-) diff --git a/tap.c b/tap.c index 3bdca9a1..956db8e4 100644 --- a/tap.c +++ b/tap.c @@ -1149,7 +1149,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, * Return: true if (last we knew) there's more space in the tap buffers, false * otherwise */ -/* cppcheck-suppress unusedFunction */ bool tap_is_full(void) { return tap_full_flag; diff --git a/tcp.c b/tcp.c index 78f546db..7e11f12b 100644 --- a/tcp.c +++ b/tcp.c @@ -423,34 +423,6 @@ int tcp_set_peek_offset(int s, int offset) return 0; } -/** - * tcp_conn_epoll_events() - epoll events mask for given connection state - * @events: Current connection events - * @conn_flags Connection flags - * - * Return: epoll events mask corresponding to implied connection state - */ -static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags) -{ - if (!events) - return 0; - - if (events & ESTABLISHED) { - if (events & TAP_FIN_SENT) - return EPOLLET; - - if (conn_flags & STALLED) - return EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET; - - return EPOLLIN | EPOLLRDHUP; - } - - if (events == TAP_SYN_RCVD) - return EPOLLOUT | EPOLLET | EPOLLRDHUP; - - return EPOLLET | EPOLLRDHUP; -} - /** * tcp_epoll_ctl() - Add/modify/delete epoll state from connection events * @c: Execution context @@ -463,7 +435,10 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock, .flowside = FLOW_SIDX(conn, !TAPSIDE(conn)), }; - struct epoll_event ev = { .data.u64 = ref.u64 }; + struct epoll_event ev = { + .events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET, + .data.u64 = ref.u64, + }; if (conn->events == CLOSED) { if (conn->in_epoll) @@ -473,8 +448,6 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) return 0; } - ev.events = tcp_conn_epoll_events(conn->events, conn->flags); - if (epoll_ctl(c->epollfd, m, conn->sock, &ev)) return -errno; @@ -1746,9 +1719,13 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp_rst(c, conn); return -1; } - tcp_data_from_sock(c, conn); } + /* The socket side might have queued data that we didn't send due to a + * full window. Now the window's updated, try again + */ + tcp_data_from_sock(c, conn); + if (!iov_i) goto out; @@ -2268,6 +2245,23 @@ bool tcp_flow_defer(const struct tcp_tap_conn *conn) /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */ void tcp_defer_handler(struct ctx *c) { + unsigned i; + + for (i = 0; i < FLOW_MAX; i++) { + union flow *flow = FLOW(i); + + if (!num_tap_full) + break; /* No need to continue */ + if (tap_is_full()) + break; /* No point in continuing */ + + if ((flow->f.type != FLOW_TCP) || + !(flow->tcp.flags & TAP_FULL)) + continue; + + tcp_data_from_sock(c, &flow->tcp); + } + tcp_flags_flush(c); tcp_payload_flush(c); } -- 2.46.0