[PATCH 00/14] RFC: tcp: Don't use separate listening sockets for spliced and non-spliced connections
We can splice TCP connections in pasta mode if and only if they originate from localhost. Currently we separate the two cases by having separate listening sockets: one listens on the host address for non-spliceable connections, the other listens on the loopback address for spliceable connections. As well as requiring twice as many listening sockets, this has the drawback of meaning that passt and pasta behaviour differ subtley but surprisingly: by default passt inbound port forwards will listen on the unspecified address, but for pasta they will be changed to listen on the host's interface address. In fact we don't need to do this. We can defer the decision about whether to splice a connection until after we've accepted it, by testing the peer address to see if it is local. At least, in principle we can. This series deals with a number of complications on the way to accomplishing that. CAVEAT: The current draft increases the size of tcp_conn above its current 64 bytes, which is likely to push it into a second cache line on some machines. This is fixable, but doing so is fiddly, and I'm still working on it. David Gibson (14): style: Minor corrections to function comments tcp: Remove unused TCP_MAX_SOCKS constant tcp: Better helpers for converting between connection pointer and index tcp_splice: Helpers for converting from index to/from tcp_splice_conn tcp: Move connection state structures into a shared header tcp: Add connection union type tcp: Improved helpers to update connections after moving tcp: Unify spliced and non-spliced connection tables tcp: Unify tcp_defer_handler and tcp_splice_defer_handler() tcp: Partially unify tcp_timer() and tcp_splice_timer() tcp: Unify the IN_EPOLL flag tcp: Separate helpers to create ns listening sockets tcp: Unify part of spliced and non-spliced conn_from_sock path tcp: Use the same sockets to listen for spliced and non-spliced connections Makefile | 3 +- conf.c | 12 +- tap.c | 6 +- tcp.c | 661 +++++++++++++++++++++++---------------------------- tcp.h | 7 +- tcp_conn.h | 208 ++++++++++++++++ tcp_splice.c | 292 +++++++++-------------- tcp_splice.h | 4 +- 8 files changed, 625 insertions(+), 568 deletions(-) create mode 100644 tcp_conn.h -- 2.38.1
Some style issues and a typo.
Signed-off-by: David Gibson
Presumably it meant something in the past, but it's no longer used.
Signed-off-by: David Gibson
The macro CONN_OR_NULL() is used to look up connections by index with
bounds checking. Replace it with an inline function, which means:
- Better type checking
- No danger of multiple evaluation of an @index with side effects
Also add a helper to perform the reverse translation: from connection
pointer to index. Introduce a macro for this which will make later
cleanups easier and safer.
Signed-off-by: David Gibson
Like we already have for non-spliced connections, create a CONN_IDX()
macro for looking up the index of spliced connection structures. Change
the name of the array of spliced connections to be different from that for
non-spliced connections (even though they're in different modules). This
will make subsequent changes a bit safer.
Signed-off-by: David Gibson
Currently spliced and non-spliced connections use completely independent
tracking structures. We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.
This requires renaming some #defines with the same name but different
meanings between the two cases. In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.
Signed-off-by: David Gibson
Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays. We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection. For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.
This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline. This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.
Signed-off-by: David Gibson
When we compact the connection tables (both spliced and non-spliced) we
need to move entries from one slot to another. That requires some updates
in the entries themselves. Add helpers to make all the necessary updates
for the spliced and non-spliced cases. This will simplify later cleanups.
Signed-off-by: David Gibson
Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections. This is a bit counter-intuitive.
More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases. So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.
For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall. If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.
For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them. We'll improve that in later patches.
Signed-off-by: David Gibson
These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections. To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.
Signed-off-by: David Gibson
These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them. Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.
Signed-off-by: David Gibson
There is very little common between the tcp_tap_conn and tcp_splice_conn
structures. However, both do have an IN_EPOLL flag which has the same
meaning in each case, though it's stored in a different location.
Simplify things slightly by moving this bit into the common header of the
two structures.
Signed-off-by: David Gibson
tcp_sock_init*() can create either sockets listening on the host, or in the pasta network namespace (with @ns==1). There are, however, a number of differences in how these two cases work in practice though. "ns" sockets are only used in pasta mode, and they always lead to spliced connections only. The functions are also only ever called in "ns" mode with a NULL address and interface name, and it doesn't really make sense for them to be called any other way. Later changes will introduce further differences in behaviour between these two cases, so it makes more sense to use separate functions for creating the ns listening sockets than the regular external/host listening sockets. --- conf.c | 6 +-- tcp.c | 130 ++++++++++++++++++++++++++++++++++++++------------------- tcp.h | 4 +- 3 files changed, 92 insertions(+), 48 deletions(-) diff --git a/conf.c b/conf.c index 3ad247e..2b39d18 100644 --- a/conf.c +++ b/conf.c @@ -209,7 +209,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, for (i = 0; i < PORT_EPHEMERAL_MIN; i++) { if (optname == 't') - tcp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i); + tcp_sock_init(c, AF_UNSPEC, NULL, NULL, i); else if (optname == 'u') udp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i); } @@ -287,7 +287,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, bitmap_set(fwd->map, i); if (optname == 't') - tcp_sock_init(c, 0, af, addr, ifname, i); + tcp_sock_init(c, af, addr, ifname, i); else if (optname == 'u') udp_sock_init(c, 0, af, addr, ifname, i); } @@ -333,7 +333,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, fwd->delta[i] = mapped_range.first - orig_range.first; if (optname == 't') - tcp_sock_init(c, 0, af, addr, ifname, i); + tcp_sock_init(c, af, addr, ifname, i); else if (optname == 'u') udp_sock_init(c, 0, af, addr, ifname, i); } diff --git a/tcp.c b/tcp.c index c9bcbfb..47c025b 100644 --- a/tcp.c +++ b/tcp.c @@ -2986,15 +2986,15 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, /** * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *addr, +static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr, const char *ifname, in_port_t port) { - union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns }; + in_port_t idx = port + c->tcp.fwd_in.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx }; bool spliced = false, tap = true; int s; @@ -3005,14 +3005,9 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad if (!addr) addr = &c->ip4.addr; - tap = !ns && !IN4_IS_ADDR_LOOPBACK(addr); + tap = !IN4_IS_ADDR_LOOPBACK(addr); } - if (ns) - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]); - else - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]); - if (tap) { s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32); @@ -3038,29 +3033,25 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad else s = -1; - if (c->tcp.fwd_out.mode == FWD_AUTO) { - if (ns) - tcp_sock_ns[port][V4] = s; - else - tcp_sock_init_lo[port][V4] = s; - } + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_init_lo[port][V4] = s; } } /** * tcp_sock_init6() - Initialise listening sockets for a given IPv6 port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -static void tcp_sock_init6(const struct ctx *c, int ns, +static void tcp_sock_init6(const struct ctx *c, const struct in6_addr *addr, const char *ifname, in_port_t port) { - union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns, - .tcp.v6 = 1 }; + in_port_t idx = port + c->tcp.fwd_in.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1, + .tcp.index = idx }; bool spliced = false, tap = true; int s; @@ -3072,14 +3063,9 @@ static void tcp_sock_init6(const struct ctx *c, int ns, if (!addr) addr = &c->ip6.addr; - tap = !ns && !IN6_IS_ADDR_LOOPBACK(addr); + tap = !IN6_IS_ADDR_LOOPBACK(addr); } - if (ns) - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]); - else - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]); - if (tap) { s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32); @@ -3104,40 +3090,99 @@ static void tcp_sock_init6(const struct ctx *c, int ns, else s = -1; - if (c->tcp.fwd_out.mode == FWD_AUTO) { - if (ns) - tcp_sock_ns[port][V6] = s; - else - tcp_sock_init_lo[port][V6] = s; - } + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_init_lo[port][V6] = s; } } /** * tcp_sock_init() - Initialise listening sockets for a given port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @af: Address family to select a specific IP version, or AF_UNSPEC * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af, - const void *addr, const char *ifname, in_port_t port) +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, + const char *ifname, in_port_t port) { if ((af == AF_INET || af == AF_UNSPEC) && c->ifi4) - tcp_sock_init4(c, ns, addr, ifname, port); + tcp_sock_init4(c, addr, ifname, port); if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6) - tcp_sock_init6(c, ns, addr, ifname, port); + tcp_sock_init6(c, addr, ifname, port); +} + +/** + * tcp_ns_sock_init4() - Init socket to listen for outbound IPv4 connections + * @c: Execution context + * @port: Port, host order + */ +static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port) +{ + in_port_t idx = port + c->tcp.fwd_out.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1, + .tcp.splice = 1, .tcp.index = idx }; + struct in_addr loopback = { htonl(INADDR_LOOPBACK) }; + int s; + + assert(c->mode == MODE_PASTA); + + s = sock_l4(c, AF_INET, IPPROTO_TCP, &loopback, NULL, port, tref.u32); + if (s >= 0) + tcp_sock_set_bufsize(c, s); + else + s = -1; + + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_ns[port][V4] = s; } /** - * tcp_sock_init_ns() - Bind sockets in namespace for outbound connections + * tcp_ns_sock_init6() - Init socket to listen for outbound IPv6 connections + * @c: Execution context + * @port: Port, host order + */ +static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port) +{ + in_port_t idx = port + c->tcp.fwd_out.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1, + .tcp.splice = 1, .tcp.v6 = 1, + .tcp.index = idx}; + int s; + + assert(c->mode == MODE_PASTA); + + s = sock_l4(c, AF_INET6, IPPROTO_TCP, &in6addr_loopback, NULL, port, + tref.u32); + if (s >= 0) + tcp_sock_set_bufsize(c, s); + else + s = -1; + + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_ns[port][V6] = s; +} + +/** + * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections + * @c: Execution context + * @port: Port, host order + */ +void tcp_ns_sock_init(const struct ctx *c, in_port_t port) +{ + if (c->ifi4) + tcp_ns_sock_init4(c, port); + if (c->ifi6) + tcp_ns_sock_init6(c, port); +} + +/** + * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections * @arg: Execution context * * Return: 0 */ -static int tcp_sock_init_ns(void *arg) +static int tcp_ns_socks_init(void *arg) { struct ctx *c = (struct ctx *)arg; unsigned port; @@ -3148,7 +3193,7 @@ static int tcp_sock_init_ns(void *arg) if (!bitmap_isset(c->tcp.fwd_out.map, port)) continue; - tcp_sock_init(c, 1, AF_UNSPEC, NULL, NULL, port); + tcp_ns_sock_init(c, port); } return 0; @@ -3278,7 +3323,7 @@ int tcp_init(struct ctx *c) if (c->mode == MODE_PASTA) { tcp_splice_init(c); - NS_CALL(tcp_sock_init_ns, c); + NS_CALL(tcp_ns_socks_init, c); refill_arg.ns = 1; NS_CALL(tcp_sock_refill, &refill_arg); @@ -3363,8 +3408,7 @@ static int tcp_port_rebind(void *arg) if ((a->c->ifi4 && tcp_sock_ns[port][V4] == -1) || (a->c->ifi6 && tcp_sock_ns[port][V6] == -1)) - tcp_sock_init(a->c, 1, AF_UNSPEC, NULL, NULL, - port); + tcp_ns_sock_init(a->c, port); } } else { for (port = 0; port < NUM_PORTS; port++) { @@ -3397,7 +3441,7 @@ static int tcp_port_rebind(void *arg) if ((a->c->ifi4 && tcp_sock_init_ext[port][V4] == -1) || (a->c->ifi6 && tcp_sock_init_ext[port][V6] == -1)) - tcp_sock_init(a->c, 0, AF_UNSPEC, NULL, NULL, + tcp_sock_init(a->c, AF_UNSPEC, NULL, NULL, port); } } diff --git a/tcp.h b/tcp.h index 49738ef..f4ed298 100644 --- a/tcp.h +++ b/tcp.h @@ -19,8 +19,8 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); int tcp_tap_handler(struct ctx *c, int af, const void *addr, const struct pool *p, const struct timespec *now); -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af, - const void *addr, const char *ifname, in_port_t port); +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, + const char *ifname, in_port_t port); int tcp_init(struct ctx *c); void tcp_timer(struct ctx *c, const struct timespec *ts); void tcp_defer_handler(struct ctx *c); -- 2.38.1
In tcp_sock_handler() we split off to handle spliced sockets before
checking anything else. However the first steps of the "new connection"
path for each case are the same: allocate a connection entry and accept()
the connection.
Remove this duplication by making tcp_conn_from_sock() handle both spliced
and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
and tcp_splice_conn_from_sock functions for the later stages which differ.
Signed-off-by: David Gibson
In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
spliced connections (these are bound to localhost) and non-spliced
connections (these are bound to the host address). This introduces a
subtle behavioural difference between pasta and passt: by default, pasta
will listen only on a single host address, whereas passt will listen on
all addresses (0.0.0.0 or ::). This also prevents us using some additional
optimizations that only work with the unspecified (0.0.0.0 or ::) address.
However, it turns out we don't need to do this. We can splice a connection
if and only if it originates from the loopback address. Currently we
ensure this by having the "spliced" listening sockets listening only on
loopback. However, we can defer the decision about whether to splice a
connection until after accept(), by checking if the connection was made
from localhost.
Signed-off-by: David Gibson
On Mon, 14 Nov 2022 17:16:57 +1100
David Gibson
We can splice TCP connections in pasta mode if and only if they originate from localhost. Currently we separate the two cases by having separate listening sockets: one listens on the host address for non-spliceable connections, the other listens on the loopback address for spliceable connections.
I couldn't finish reviewing the whole series in detail yet, but I had a look at all the patches and, except for a couple of minor style issues (those are the ones I still need to finish checking), I couldn't see any flaw -- all the single patches look fine to me. -- Stefano
On Tue, Nov 15, 2022 at 02:22:41AM +0100, Stefano Brivio wrote:
On Mon, 14 Nov 2022 17:16:57 +1100 David Gibson
wrote: We can splice TCP connections in pasta mode if and only if they originate from localhost. Currently we separate the two cases by having separate listening sockets: one listens on the host address for non-spliceable connections, the other listens on the loopback address for spliceable connections.
I couldn't finish reviewing the whole series in detail yet, but I had a look at all the patches and, except for a couple of minor style issues (those are the ones I still need to finish checking), I couldn't see any flaw -- all the single patches look fine to me.
Ok. Probably not worth looking deeper at this stage. I'm pretty close to sending a new series which contains a slightly revised version of this, plus the rest of the stuff necessary to do dual stack TCP sockets. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio