As we discussed on email, this adds support for sending an RST in response to packets from the guest which don't match an existing flow and are neither SYN (requesting a new connection) nor themselves RST. This is a sligjhtly larger patch than I'd like, but I can't really see a way to simplify it without making fairly extensive reworks to share more code with paths for RST where there is a known connection. That would end up being more churn. This doesn't (IMO) correctly handle IPv6 flow labels. Fixing that raises several additional issues regarding flow labels, so I've decided to defer that for now. v2: * Assorted cosmetic fixups * Use correct IPv6 flow label for packets * This required two preliminary patches * tcp_rst_no_conn() is now static David Gibson (3): ip: Helpers to access IPv6 flow label tap: Consider IPv6 flow label when building packet sequences tcp: Send RST in response to guest packets that match no connection ip.h | 24 ++++++++++++++++++ tap.c | 25 ++++++++++--------- tap.h | 6 +++++ tcp.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- tcp.h | 2 +- 5 files changed, 118 insertions(+), 17 deletions(-) -- 2.48.1
The flow label is a 20-bit field in the IPv6 header. The length and alignment make it awkward to pass around as is. Obviously, it can be packed into a 32-bit integer though, and we do this in two places. We have some further upcoming places where we want to manipulate the flow label, so make some helpers for marshalling and unmarshalling it to an integer. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- ip.h | 25 +++++++++++++++++++++++++ tap.c | 4 +--- tcp.c | 4 +--- 3 files changed, 27 insertions(+), 6 deletions(-) diff --git a/ip.h b/ip.h index 858cc899..5edb7e77 100644 --- a/ip.h +++ b/ip.h @@ -91,6 +91,31 @@ struct ipv6_opt_hdr { */ } __attribute__((packed)); /* required for some archs */ +/** + * ip6_set_flow_lbl() - Set flow label in an IPv6 header + * @ip6h: Pointer to IPv6 header, updated + * @flow: Set @ip6h flow label to the low 20 bits of this integer + */ +static inline void ip6_set_flow_lbl(struct ipv6hdr *ip6h, uint32_t flow) +{ + ip6h->flow_lbl[0] = (flow >> 16) & 0xf; + ip6h->flow_lbl[1] = (flow >> 8) & 0xff; + ip6h->flow_lbl[2] = (flow >> 0) & 0xff; +} + +/** ip6_get_flow_lbl() - Get flow label from an IPv6 header + * @ip6h: Pointer to IPv6 header + * + * Return: flow label from @ip6h as an integer (<= 20 bits) + */ +/* cppcheck-suppress unusedFunction */ +static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h) +{ + return (ip6h->flow_lbl[0] & 0xf) << 16 | + ip6h->flow_lbl[1] << 8 | + ip6h->flow_lbl[2]; +} + char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto, size_t *dlen); diff --git a/tap.c b/tap.c index 44b0fc0f..39082627 100644 --- a/tap.c +++ b/tap.c @@ -241,9 +241,7 @@ static void *tap_push_ip6h(struct ipv6hdr *ip6h, ip6h->hop_limit = 255; ip6h->saddr = *src; ip6h->daddr = *dst; - ip6h->flow_lbl[0] = (flow >> 16) & 0xf; - ip6h->flow_lbl[1] = (flow >> 8) & 0xff; - ip6h->flow_lbl[2] = (flow >> 0) & 0xff; + ip6_set_flow_lbl(ip6h, flow); return ip6h + 1; } diff --git a/tcp.c b/tcp.c index b3aa9a2c..74598039 100644 --- a/tcp.c +++ b/tcp.c @@ -963,9 +963,7 @@ void tcp_fill_headers(const struct tcp_tap_conn *conn, ip6h->version = 6; ip6h->nexthdr = IPPROTO_TCP; - ip6h->flow_lbl[0] = (conn->sock >> 16) & 0xf; - ip6h->flow_lbl[1] = (conn->sock >> 8) & 0xff; - ip6h->flow_lbl[2] = (conn->sock >> 0) & 0xff; + ip6_set_flow_lbl(ip6h, conn->sock); if (!no_tcp_csum) { psum = proto_ipv6_header_psum(l4len, IPPROTO_TCP, -- 2.48.1
To allow more batching, we group together related packets into "seqs" in the tap layer, before passing them to the L4 protocol layers. Currently we consider the IP protocol, both IP addresses and also the L4 ports when grouping things into seqs. We ignore the IPv6 flow label. We have some future cases where we want to consider the the flow label in the L4 code, which is awkward if we could be given a single batch with multiple labels. Add the flow label to tap6_l4_t and group by it as well as the other criteria. In future we could possibly use the flow label _instead_ of peeking into the L4 header for the ports, but we don't do so for now. The guest should use the same flow label for all packets in a low, but if it doesn't this change won't break anything, it just means we'll batch things a bit sub-optimally. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- ip.h | 1 - tap.c | 4 ++++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/ip.h b/ip.h index 5edb7e77..c82431e9 100644 --- a/ip.h +++ b/ip.h @@ -108,7 +108,6 @@ static inline void ip6_set_flow_lbl(struct ipv6hdr *ip6h, uint32_t flow) * * Return: flow label from @ip6h as an integer (<= 20 bits) */ -/* cppcheck-suppress unusedFunction */ static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h) { return (ip6h->flow_lbl[0] & 0xf) << 16 | diff --git a/tap.c b/tap.c index 39082627..202abae9 100644 --- a/tap.c +++ b/tap.c @@ -489,6 +489,7 @@ static struct tap4_l4_t { * struct l4_seq6_t - Message sequence for one protocol handler call, IPv6 * @msgs: Count of messages in sequence * @protocol: Protocol number + * @flow_lbl: IPv6 flow label * @source: Source port * @dest: Destination port * @saddr: Source address @@ -497,6 +498,7 @@ static struct tap4_l4_t { */ static struct tap6_l4_t { uint8_t protocol; + uint32_t flow_lbl :20; uint16_t source; uint16_t dest; @@ -870,6 +872,7 @@ resume: ((seq)->protocol == (proto) && \ (seq)->source == (uh)->source && \ (seq)->dest == (uh)->dest && \ + (seq)->flow_lbl == ip6_get_flow_lbl(ip6h) && \ IN6_ARE_ADDR_EQUAL(&(seq)->saddr, saddr) && \ IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr)) @@ -878,6 +881,7 @@ resume: (seq)->protocol = (proto); \ (seq)->source = (uh)->source; \ (seq)->dest = (uh)->dest; \ + (seq)->flow_lbl = ip6_get_flow_lbl(ip6h); \ (seq)->saddr = *saddr; \ (seq)->daddr = *daddr; \ } while (0) -- 2.48.1
Currently, if a non-SYN TCP packet arrives which doesn't match any existing connection, we simply ignore it. However RFC 9293, section 3.10.7.1 says we should respond with an RST to a non-SYN, non-RST packet that's for a CLOSED (i.e. non-existent) connection. This can arise in practice with migration, in cases where some error means we have to discard a connection. We destroy the connection with tcp_rst() in that case, but because the guest is stopped, we may not be able to deliver the RST packet on the tap interface immediately. This change ensures an RST will be sent if the guest tries to use the connection again. A similar situation can arise if a passt/pasta instance is killed or crashes, but is then replaced with another attached to the same guest. This can leave the guest with stale connections that the new passt instance isn't aware of. It's better to send an RST so the guest knows quickly these are broken, rather than letting them linger until they time out. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 17 +++++++------- tap.h | 6 +++++ tcp.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- tcp.h | 2 +- 4 files changed, 88 insertions(+), 11 deletions(-) diff --git a/tap.c b/tap.c index 202abae9..86d051e3 100644 --- a/tap.c +++ b/tap.c @@ -122,7 +122,7 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c, * * Return: pointer at which to write the packet's payload */ -static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto) +void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto) { struct ethhdr *eh = (struct ethhdr *)buf; @@ -143,8 +143,8 @@ static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto) * * Return: pointer at which to write the packet's payload */ -static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, - struct in_addr dst, size_t l4len, uint8_t proto) +void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, + struct in_addr dst, size_t l4len, uint8_t proto) { uint16_t l3len = l4len + sizeof(*ip4h); @@ -229,10 +229,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst, * * Return: pointer at which to write the packet's payload */ -static void *tap_push_ip6h(struct ipv6hdr *ip6h, - const struct in6_addr *src, - const struct in6_addr *dst, - size_t l4len, uint8_t proto, uint32_t flow) +void *tap_push_ip6h(struct ipv6hdr *ip6h, + const struct in6_addr *src, const struct in6_addr *dst, + size_t l4len, uint8_t proto, uint32_t flow) { ip6h->payload_len = htons(l4len); ip6h->priority = 0; @@ -744,7 +743,7 @@ append: for (k = 0; k < p->count; ) k += tcp_tap_handler(c, PIF_TAP, AF_INET, &seq->saddr, &seq->daddr, - p, k, now); + 0, p, k, now); } else if (seq->protocol == IPPROTO_UDP) { if (c->no_udp) continue; @@ -927,7 +926,7 @@ append: for (k = 0; k < p->count; ) k += tcp_tap_handler(c, PIF_TAP, AF_INET6, &seq->saddr, &seq->daddr, - p, k, now); + seq->flow_lbl, p, k, now); } else if (seq->protocol == IPPROTO_UDP) { if (c->no_udp) continue; diff --git a/tap.h b/tap.h index a476a125..390ac127 100644 --- a/tap.h +++ b/tap.h @@ -42,6 +42,9 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) thdr->vnet_len = htonl(l2len); } +void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto); +void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, + struct in_addr dst, size_t l4len, uint8_t proto); void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, struct in_addr dst, in_port_t dport, const void *in, size_t dlen); @@ -49,6 +52,9 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst, const void *in, size_t l4len); const struct in6_addr *tap_ip6_daddr(const struct ctx *c, const struct in6_addr *src); +void *tap_push_ip6h(struct ipv6hdr *ip6h, + const struct in6_addr *src, const struct in6_addr *dst, + size_t l4len, uint8_t proto, uint32_t flow); void tap_udp6_send(const struct ctx *c, const struct in6_addr *src, in_port_t sport, const struct in6_addr *dst, in_port_t dport, diff --git a/tcp.c b/tcp.c index 74598039..fb04e2e7 100644 --- a/tcp.c +++ b/tcp.c @@ -1866,6 +1866,75 @@ static void tcp_conn_from_sock_finish(const struct ctx *c, tcp_data_from_sock(c, conn); } +/** + * tcp_rst_no_conn() - Send RST in response to a packet with no connection + * @c: Execution context + * @af: Address family, AF_INET or AF_INET6 + * @saddr: Source address of the packet we're responding to + * @daddr: Destination address of the packet we're responding to + * @flow_lbl: IPv6 flow label (ignored for IPv4) + * @th: TCP header of the packet we're responding to + * @l4len: Packet length, including TCP header + */ +static void tcp_rst_no_conn(const struct ctx *c, int af, + const void *saddr, const void *daddr, + uint32_t flow_lbl, + const struct tcphdr *th, size_t l4len) +{ + struct iov_tail payload = IOV_TAIL(NULL, 0, 0); + struct tcphdr *rsth; + char buf[USHRT_MAX]; + uint32_t psum = 0; + size_t rst_l2len; + + /* Don't respond to RSTs without a connection */ + if (th->rst) + return; + + if (af == AF_INET) { + struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP); + const struct in_addr *rst_src = daddr; + const struct in_addr *rst_dst = saddr; + + rsth = tap_push_ip4h(ip4h, *rst_src, *rst_dst, + sizeof(*rsth), IPPROTO_TCP); + psum = proto_ipv4_header_psum(sizeof(*rsth), IPPROTO_TCP, + *rst_src, *rst_dst); + + } else { + struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6); + const struct in6_addr *rst_src = daddr; + const struct in6_addr *rst_dst = saddr; + + rsth = tap_push_ip6h(ip6h, rst_src, rst_dst, + sizeof(*rsth), IPPROTO_TCP, flow_lbl); + psum = proto_ipv6_header_psum(sizeof(*rsth), IPPROTO_TCP, + rst_src, rst_dst); + } + + memset(rsth, 0, sizeof(*rsth)); + + rsth->source = th->dest; + rsth->dest = th->source; + rsth->rst = 1; + rsth->doff = sizeof(*rsth) / 4UL; + + /* Sequence matching logic from RFC 9293 section 3.10.7.1 */ + if (th->ack) { + rsth->seq = th->ack_seq; + } else { + size_t dlen = l4len - th->doff * 4UL; + uint32_t ack = ntohl(th->seq) + dlen; + + rsth->ack_seq = htonl(ack); + rsth->ack = 1; + } + + tcp_update_csum(psum, rsth, &payload); + rst_l2len = ((char *)rsth - buf) + sizeof(*rsth); + tap_send_single(c, buf, rst_l2len); +} + /** * tcp_tap_handler() - Handle packets from tap and state transitions * @c: Execution context @@ -1873,6 +1942,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c, * @af: Address family, AF_INET or AF_INET6 * @saddr: Source address * @daddr: Destination address + * @flow_lbl: IPv6 flow label (ignored for IPv4) * @p: Pool of TCP packets, with TCP headers * @idx: Index of first packet in pool to process * @now: Current timestamp @@ -1880,7 +1950,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c, * Return: count of consumed packets */ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, - const void *saddr, const void *daddr, + const void *saddr, const void *daddr, uint32_t flow_lbl, const struct pool *p, int idx, const struct timespec *now) { struct tcp_tap_conn *conn; @@ -1913,6 +1983,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (opts && th->syn && !th->ack) tcp_conn_from_tap(c, af, saddr, daddr, th, opts, optlen, now); + else + tcp_rst_no_conn(c, af, saddr, daddr, flow_lbl, th, len); return 1; } diff --git a/tcp.h b/tcp.h index cf30744d..9142ecac 100644 --- a/tcp.h +++ b/tcp.h @@ -16,7 +16,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events); int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, - const void *saddr, const void *daddr, + const void *saddr, const void *daddr, uint32_t flow_lbl, const struct pool *p, int idx, const struct timespec *now); int tcp_sock_init(const struct ctx *c, const union inany_addr *addr, const char *ifname, in_port_t port); -- 2.48.1
On Wed, 5 Mar 2025 15:32:27 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:As we discussed on email, this adds support for sending an RST in response to packets from the guest which don't match an existing flow and are neither SYN (requesting a new connection) nor themselves RST. This is a sligjhtly larger patch than I'd like, but I can't really see a way to simplify it without making fairly extensive reworks to share more code with paths for RST where there is a known connection.Yeah, me neither...That would end up being more churn. This doesn't (IMO) correctly handle IPv6 flow labels. Fixing that raises several additional issues regarding flow labels, so I've decided to defer that for now. v2: * Assorted cosmetic fixups * Use correct IPv6 flow label for packets * This required two preliminary patches * tcp_rst_no_conn() is now staticApplied. -- Stefano