[PATCH v2 00/10] RFC: Convert TCP connection table to generalisable flow table
This is a second draft of the first steps in implementing more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table into a more general flow table that can track other protocols as well (although none are implemented yet). Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. Caveats: * We significantly increase the size of a connection/flow entry - Can probably be mitigated, but I haven't investigated much yet * We perform a number of extra getsockname() calls to know some of the socket endpoints - Haven't yet measured how much performance impact that has - Can be mitigated in at least some cases, but again, haven't tried yet * Only TCP converted so far Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (10): flow, tcp: Generalise connection types flow, tcp: Move TCP connection table to unified flow table flow, tcp: Consolidate flow pointer<->index helpers flow: Make unified version of flow table compaction flow: Introduce struct flowside, space for uniform tracking of addresses tcp: Move guest side address tracking to flow/flowside tcp, flow: Perform TCP hash calculations based on flowside tcp: Re-use flowside_hash for initial sequence number generation tcp: Maintain host flowside for connections tcp_splice: Fill out flowside information for spliced connections Makefile | 14 +- flow.c | 111 ++++++++++++++++ flow.h | 115 +++++++++++++++++ flow_table.h | 45 +++++++ passt.h | 3 + siphash.c | 1 + tcp.c | 355 ++++++++++++++++++++++++--------------------------- tcp.h | 5 - tcp_conn.h | 54 ++------ tcp_splice.c | 78 ++++++----- tcp_splice.h | 3 +- 11 files changed, 505 insertions(+), 279 deletions(-) create mode 100644 flow.c create mode 100644 flow.h create mode 100644 flow_table.h -- 2.41.0
Currently TCP connections use a 1-bit selector, 'spliced', to determine the
rest of the contents of the structure. We want to generalise the TCP
connection table to other types of flows in other protocols. Make a start
on this by replacing the tcp_conn_common structure with a new flow_common
structure with an enum rather than a simple boolean indicating the type of
flow.
Signed-off-by: David Gibson
We want to generalise "connection" tracking to things other than true TCP
connections. Continue implenenting this by renaming the TCP connection
table to the "flow table" and moving it to flow.c. The definitions are
split between flow.h and flow_table.h - we need this separation to avoid
circular dependencies: the definitions in flow.h will be needed by many
headers using the flow mechanism, but flow_table.h needs all those protocol
specific headers in order to define the full flow table entry.
Signed-off-by: David Gibson
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index
of their connection structures in the connection table, now become the
unified flow table. We can easily combine these into a common helper.
While we're there, add some trickery for some additional type safety.
They also define their own CONN() versions, which aren't so easily combined
since they need to return different types, but we can have them use a
common helper.
Signed-off-by: David Gibson
On Mon, 28 Aug 2023 15:41:39 +1000
David Gibson
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index of their connection structures in the connection table, now become the unified flow table. We can easily combine these into a common helper. While we're there, add some trickery for some additional type safety.
They also define their own CONN() versions, which aren't so easily combined since they need to return different types, but we can have them use a common helper.
Signed-off-by: David Gibson
--- flow_table.h | 20 ++++++++++++++++++++ tcp.c | 49 ++++++++++++++++++++++++------------------------- tcp_conn.h | 2 +- tcp_splice.c | 17 ++++++++--------- 4 files changed, 53 insertions(+), 35 deletions(-) diff --git a/flow_table.h b/flow_table.h index c4c646b..dd4875e 100644 --- a/flow_table.h +++ b/flow_table.h @@ -22,4 +22,24 @@ union flow { /* Global Flow Table */ extern union flow flowtab[];
+ +/** flowk_idx - Index of flow from common structure
"flowk"
+ * @f: Common flow fields pointer + * + * Return: index of @f in the flow table + */ +static inline unsigned flow_idx(const struct flow_common *f) +{ + return (union flow *)f - flowtab; +} + +/** FLOW_IDX - Find the index of a flow + * @f_: Flow pointer, either union flow * or protocol specific + * + * Return: index of @f in the flow table + */ +#define FLOW_IDX(f_) (flow_idx(&(f_)->f)) + +#define FLOW(index) (&flowtab[(index)]) + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 7994197..7d2e89d 100644 --- a/tcp.c +++ b/tcp.c @@ -561,8 +561,7 @@ tcp6_l2_flags_buf[TCP_FRAMES_MEM];
static unsigned int tcp6_l2_flags_buf_used;
-#define CONN(index) (&flowtab[(index)].tcp) -#define CONN_IDX(conn) ((union flow *)(conn) - flowtab) +#define CONN(index) (&FLOW(index)->tcp)
Extra parentheses are not needed, but I've been wondering for a bit why you would use "&FLOW" here, as FLOW(x) already resolves to &flowtab[x]... perhaps (&(FLOW(index)->tcp)) is actually easier to read? -- Stefano
On Thu, Sep 07, 2023 at 03:01:21AM +0200, Stefano Brivio wrote:
On Mon, 28 Aug 2023 15:41:39 +1000 David Gibson
wrote: Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index of their connection structures in the connection table, now become the unified flow table. We can easily combine these into a common helper. While we're there, add some trickery for some additional type safety.
They also define their own CONN() versions, which aren't so easily combined since they need to return different types, but we can have them use a common helper.
Signed-off-by: David Gibson
--- flow_table.h | 20 ++++++++++++++++++++ tcp.c | 49 ++++++++++++++++++++++++------------------------- tcp_conn.h | 2 +- tcp_splice.c | 17 ++++++++--------- 4 files changed, 53 insertions(+), 35 deletions(-) diff --git a/flow_table.h b/flow_table.h index c4c646b..dd4875e 100644 --- a/flow_table.h +++ b/flow_table.h @@ -22,4 +22,24 @@ union flow { /* Global Flow Table */ extern union flow flowtab[];
+ +/** flowk_idx - Index of flow from common structure
"flowk"
Oops, fixed.
+ * @f: Common flow fields pointer + * + * Return: index of @f in the flow table + */ +static inline unsigned flow_idx(const struct flow_common *f) +{ + return (union flow *)f - flowtab; +} + +/** FLOW_IDX - Find the index of a flow + * @f_: Flow pointer, either union flow * or protocol specific + * + * Return: index of @f in the flow table + */ +#define FLOW_IDX(f_) (flow_idx(&(f_)->f)) + +#define FLOW(index) (&flowtab[(index)]) + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 7994197..7d2e89d 100644 --- a/tcp.c +++ b/tcp.c @@ -561,8 +561,7 @@ tcp6_l2_flags_buf[TCP_FRAMES_MEM];
static unsigned int tcp6_l2_flags_buf_used;
-#define CONN(index) (&flowtab[(index)].tcp) -#define CONN_IDX(conn) ((union flow *)(conn) - flowtab) +#define CONN(index) (&FLOW(index)->tcp)
Extra parentheses are not needed, but I've been wondering for a bit why you would use "&FLOW" here, as FLOW(x) already resolves to &flowtab[x]... perhaps (&(FLOW(index)->tcp)) is actually easier to read?
Makes sense to me, done. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
tcp_table_compact() will move entries in the connection/flow table to keep
it compact when other entries are removed. The moved entries need not have
the same type as the flow removed, so it needs to be able to handle moving
any type of flow. Therefore, move it to flow.c rather than being
purportedly TCP specific.
Signed-off-by: David Gibson
Handling of each protocol needs some degree of tracking of the addresses
and ports at the end of each connection or flow. Sometimes that's explicit
(as in the guest visible addresses for TCP connections), sometimes implicit
(the bound and connected addresses of sockets).
To allow more general abd robust handling, and more consistency across
protocols we want to uniformly track the address and port at each end of
the connection. Furthermore, because we allow port remapping, and we
sometimes need to apply NAT, the addresses and ports can be different as
seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of the address and ports of a
flow from a single "side" (guest or host). Store two of these in the
common fields of a flow to track that information for both sides.
For now we just introduce the structure and fields themselves, along with
some simple helpers. Later patches will actually use these to store useful
information.
Signed-off-by: David Gibson
On Mon, 28 Aug 2023 15:41:41 +1000
David Gibson
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).
To allow more general abd robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of the address and ports of a flow from a single "side" (guest or host). Store two of these in the common fields of a flow to track that information for both sides.
For now we just introduce the structure and fields themselves, along with some simple helpers. Later patches will actually use these to store useful information.
Signed-off-by: David Gibson
--- flow.c | 22 ++++++++++++++++++++++ flow.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/flow.c b/flow.c index 12ca8db..a93cf8c 100644 --- a/flow.c +++ b/flow.c @@ -7,6 +7,7 @@
#include
#include +#include #include "util.h" #include "passt.h" @@ -24,6 +25,27 @@ const char *flow_type_str[] = { /* Global Flow Table */ union flow flowtab[FLOW_MAX];
+/** flowside_fmt - Format a flowside as a string + * @fs: flowside to format + * @buf: Buffer into which to store the formatted version + * @size: Size of @buf + * + * Return: pointer to formatted string describing @fs, or NULL on error + */ +/* cppcheck-suppress unusedFunction */ +const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + + if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) + || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))
For consistency (also with flowside_complete() below): if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))
+ return NULL; + + snprintf(buf, size, "[%s]:%hu <-> [%s]:%hu", fbuf, fs->fport, + ebuf, fs->eport); + return (const char *)buf; +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context diff --git a/flow.h b/flow.h index e212796..9891fcb 100644 --- a/flow.h +++ b/flow.h @@ -18,11 +18,59 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type <= FLOW_MAX ? flow_type_str[(f)->type] : "?")
+/** + * struct flowside - Describes a logical packet flow as seen from one "side" + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport, eport;
I guess always valid, but uncommon (compared to in_port_t x; in_port_t y;)?
+}; + +/** flowside_from_af - Initialize a flowside from addresses + * @fs: flowside to initialize + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fs, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + inany_from_af(&fs->faddr, af, faddr); + inany_from_af(&fs->eaddr, af, eaddr); + fs->fport = fport; + fs->eport = eport; +} + +/** flowside_complete - Check if flowside is fully initialized + * @fs: flowside to check + */ +static inline bool flowside_complete(const struct flowside *fs) +{ + return !IN6_IS_ADDR_UNSPECIFIED(&fs->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fs->eaddr) && + fs->fport != 0 && fs->eport != 0;
Zero is reserved for TCP (which could be problematic anyway if we try to match things), but for UDP a "zero" port can actually be used (in the probably desuete sense of "no port"). Maybe we should use -1 instead?
+} + +#define FLOWSIDE_STRLEN (2*(INET6_ADDRSTRLEN+8) + 6)
For consistency: "(2 * INET6_ADDRSTRLEN + 8) + 6)". Limited to the usage I've seen in 6/10 (maybe I'm ignoring something): is it worth it to have flowside_fmt() as a function forming a string, rather than something calling debug() with what we want...? At the moment we have tap_packet_debug(), admittedly not very elegant but perhaps more terse than this. -- Stefano
On Thu, Sep 07, 2023 at 03:01:40AM +0200, Stefano Brivio wrote:
On Mon, 28 Aug 2023 15:41:41 +1000 David Gibson
wrote: Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).
To allow more general abd robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of the address and ports of a flow from a single "side" (guest or host). Store two of these in the common fields of a flow to track that information for both sides.
For now we just introduce the structure and fields themselves, along with some simple helpers. Later patches will actually use these to store useful information.
Signed-off-by: David Gibson
--- flow.c | 22 ++++++++++++++++++++++ flow.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/flow.c b/flow.c index 12ca8db..a93cf8c 100644 --- a/flow.c +++ b/flow.c @@ -7,6 +7,7 @@
#include
#include +#include #include "util.h" #include "passt.h" @@ -24,6 +25,27 @@ const char *flow_type_str[] = { /* Global Flow Table */ union flow flowtab[FLOW_MAX];
+/** flowside_fmt - Format a flowside as a string + * @fs: flowside to format + * @buf: Buffer into which to store the formatted version + * @size: Size of @buf + * + * Return: pointer to formatted string describing @fs, or NULL on error + */ +/* cppcheck-suppress unusedFunction */ +const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + + if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) + || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))
For consistency (also with flowside_complete() below):
if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))
Done.
+ return NULL; + + snprintf(buf, size, "[%s]:%hu <-> [%s]:%hu", fbuf, fs->fport, + ebuf, fs->eport); + return (const char *)buf; +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context diff --git a/flow.h b/flow.h index e212796..9891fcb 100644 --- a/flow.h +++ b/flow.h @@ -18,11 +18,59 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type <= FLOW_MAX ? flow_type_str[(f)->type] : "?")
+/** + * struct flowside - Describes a logical packet flow as seen from one "side" + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport, eport;
I guess always valid, but uncommon (compared to in_port_t x; in_port_t y;)?
Huh.. I didn't even think about that (and the fact that I did this for the ports, but not for the addresses). Changed it to the more conventional style.
+}; + +/** flowside_from_af - Initialize a flowside from addresses + * @fs: flowside to initialize + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fs, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + inany_from_af(&fs->faddr, af, faddr); + inany_from_af(&fs->eaddr, af, eaddr); + fs->fport = fport; + fs->eport = eport; +} + +/** flowside_complete - Check if flowside is fully initialized + * @fs: flowside to check + */ +static inline bool flowside_complete(const struct flowside *fs) +{ + return !IN6_IS_ADDR_UNSPECIFIED(&fs->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fs->eaddr) && + fs->fport != 0 && fs->eport != 0;
Zero is reserved for TCP (which could be problematic anyway if we try to match things),
The more practical consideration here is that a 0 port is used in the sockaddr passed to bind() to represent "pick a port for me". The point of this is to check that we have a "fully specified" flowside, that provides sufficient information to both bind() and connect() a socket with no ambiguity on the endpoints.
but for UDP a "zero" port can actually be used (in the probably desuete sense of "no port").
[Aside: I've never encountered the word desuete before] By "no port" here are you meaning for UDP traffic that expects no response? If that's so we probably neither need or want to create a flow for it anyway. In any case, even if port 0 can be used at the protocol level, I don't think it can be used at the socket level: I'm pretty sure the bind() behaviour of treating 0 as "pick for me" is true for UDP as well as TCP - it's functionality that's basically necessary, and I can't see any other way to specify it.
Maybe we should use -1 instead?
That doesn't really help - port 65535 is itself valid, so unless we widen these fields - which I don't really want to do - it's still ambiguous (in fact, worse, because AFAICT port 65535 could actually be used in practice). Even if we did widen, it's still a problem, because if we got the port from, for example, getsockname() on a not-yet-connected socket, that will give us 0 and the point of this function is to tell us that's not a fully specified endpoint.
+} + +#define FLOWSIDE_STRLEN (2*(INET6_ADDRSTRLEN+8) + 6)
For consistency: "(2 * INET6_ADDRSTRLEN + 8) + 6)".
Done.
Limited to the usage I've seen in 6/10 (maybe I'm ignoring something): is it worth it to have flowside_fmt() as a function forming a string, rather than something calling debug() with what we want...? At the moment we have tap_packet_debug(), admittedly not very elegant but perhaps more terse than this.
There's at least one more use coming in the remainder of the series, and I'd expect to see more as other protocols are added to the flow mechanics. I also think it could be a very useful helper when adding ad-hoc debugging. So, yes, I think it's worth it. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 7 Sep 2023 14:05:37 +1000
David Gibson
On Thu, Sep 07, 2023 at 03:01:40AM +0200, Stefano Brivio wrote:
On Mon, 28 Aug 2023 15:41:41 +1000 David Gibson
wrote: Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).
To allow more general abd robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of the address and ports of a flow from a single "side" (guest or host). Store two of these in the common fields of a flow to track that information for both sides.
For now we just introduce the structure and fields themselves, along with some simple helpers. Later patches will actually use these to store useful information.
Signed-off-by: David Gibson
--- flow.c | 22 ++++++++++++++++++++++ flow.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/flow.c b/flow.c index 12ca8db..a93cf8c 100644 --- a/flow.c +++ b/flow.c @@ -7,6 +7,7 @@
#include
#include +#include #include "util.h" #include "passt.h" @@ -24,6 +25,27 @@ const char *flow_type_str[] = { /* Global Flow Table */ union flow flowtab[FLOW_MAX];
+/** flowside_fmt - Format a flowside as a string + * @fs: flowside to format + * @buf: Buffer into which to store the formatted version + * @size: Size of @buf + * + * Return: pointer to formatted string describing @fs, or NULL on error + */ +/* cppcheck-suppress unusedFunction */ +const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + + if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) + || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))
For consistency (also with flowside_complete() below):
if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))
Done.
+ return NULL; + + snprintf(buf, size, "[%s]:%hu <-> [%s]:%hu", fbuf, fs->fport, + ebuf, fs->eport); + return (const char *)buf; +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context diff --git a/flow.h b/flow.h index e212796..9891fcb 100644 --- a/flow.h +++ b/flow.h @@ -18,11 +18,59 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type <= FLOW_MAX ? flow_type_str[(f)->type] : "?")
+/** + * struct flowside - Describes a logical packet flow as seen from one "side" + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport, eport;
I guess always valid, but uncommon (compared to in_port_t x; in_port_t y;)?
Huh.. I didn't even think about that (and the fact that I did this for the ports, but not for the addresses). Changed it to the more conventional style.
+}; + +/** flowside_from_af - Initialize a flowside from addresses + * @fs: flowside to initialize + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fs, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + inany_from_af(&fs->faddr, af, faddr); + inany_from_af(&fs->eaddr, af, eaddr); + fs->fport = fport; + fs->eport = eport; +} + +/** flowside_complete - Check if flowside is fully initialized + * @fs: flowside to check + */ +static inline bool flowside_complete(const struct flowside *fs) +{ + return !IN6_IS_ADDR_UNSPECIFIED(&fs->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fs->eaddr) && + fs->fport != 0 && fs->eport != 0;
Zero is reserved for TCP (which could be problematic anyway if we try to match things),
The more practical consideration here is that a 0 port is used in the sockaddr passed to bind() to represent "pick a port for me". The point of this is to check that we have a "fully specified" flowside, that provides sufficient information to both bind() and connect() a socket with no ambiguity on the endpoints.
but for UDP a "zero" port can actually be used (in the probably desuete sense of "no port").
[Aside: I've never encountered the word desuete before]
By "no port" here are you meaning for UDP traffic that expects no response? If that's so we probably neither need or want to create a flow for it anyway.
The fact that no response is expected is probably a practical consequence of this... but RFC 768, "Fields", really says: Source Port is an optional field, when meaningful, it indicates the port of the sending process, and may be assumed to be the port to which a reply should be addressed in the absence of any other information. If not used, a value of zero is inserted.
In any case, even if port 0 can be used at the protocol level, I don't think it can be used at the socket level: I'm pretty sure the bind() behaviour of treating 0 as "pick for me" is true for UDP as well as TCP - it's functionality that's basically necessary, and I can't see any other way to specify it.
Right, yes.
Maybe we should use -1 instead?
That doesn't really help - port 65535 is itself valid, so unless we widen these fields - which I don't really want to do - it's still ambiguous (in fact, worse, because AFAICT port 65535 could actually be used in practice).
Oops, sorry, I didn't consider that. Of course.
Even if we did widen, it's still a problem, because if we got the port from, for example, getsockname() on a not-yet-connected socket, that will give us 0 and the point of this function is to tell us that's not a fully specified endpoint.
Right... forget about this, I didn't consider that.
+} + +#define FLOWSIDE_STRLEN (2*(INET6_ADDRSTRLEN+8) + 6)
For consistency: "(2 * INET6_ADDRSTRLEN + 8) + 6)".
Done.
Limited to the usage I've seen in 6/10 (maybe I'm ignoring something): is it worth it to have flowside_fmt() as a function forming a string, rather than something calling debug() with what we want...? At the moment we have tap_packet_debug(), admittedly not very elegant but perhaps more terse than this.
There's at least one more use coming in the remainder of the series, and I'd expect to see more as other protocols are added to the flow mechanics. I also think it could be a very useful helper when adding ad-hoc debugging. So, yes, I think it's worth it.
Ah, okay. -- Stefano
tcp_tap_conn has several fields to track addresses and ports as seen by
the guest/namespace. However we now have general fields for this in the
common flowside fields of struct flow. Use those instead of protocol
specific fields.
So far we've only explicitly tracked the guest side forwarding address
in the TCP connection - the remote address from the guest's point of
view. The tap side endpoint address - the local address from the
guest's point of view - was assumed to always be one of the handful of
guest addresses we track as addr_seen (one each for IPv4, IPv6 global
and IPv6 link-local).
struct flowside expects both addresses, and we will want to use the
endpoint address in future. So, determine that address and store it as
part of the flowside.
Signed-off-by: David Gibson
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports. We hope in future to allow for multiple guest side addresses,
which means we may need to distinguish based on the endpoint address
as well.
Extend the hash function to include this information. Since this now
exactly matches the contents of the guest flowside, we can base our hash
functions on that, rather than a group of individual parameters.
We also put some of the helpers in flow.h, because we hope to be able to
re-use the hashing logic for other cases in future as well.
Signed-off-by: David Gibson
We generate TCP initial sequence numbers, when we need them, from a hash
of the source and destination addresses and ports, plus a timestamp.
The contents of that hash are now exactly the same as the flowside_hash()
we use elsewhere. The values won't be identical because we order the
fields in the hash differently, but that doesn't matter for our purposes.
Signed-off-by: David Gibson
We now maintain a struct flowside describing each TCP connection as it
appears to the guest. We don't explicitly have the same information
for the connections as they appear to the host, however. Rather, that
information is implicit in the state of the host side socket. For
future generalisations of flow/connection tracking, we're going to
need to use this information more heavily, so properly populate the
other flowside in each flow table entry.
This does require an additional getsockname() call for each new connection.
We hope to optimise that away for at least some cases in future.
Signed-off-by: David Gibson
Every flow in the flow table now has space for the the addresses as seen by
both the host and guest side. We fill that information in for regular
"tap" TCP connections, but not for spliced connections.
Fill in that information for spliced connections too, so it's now uniformly
available for all flow types (that we've implemented so far).
Signed-off-by: David Gibson
On Mon, 28 Aug 2023 15:41:46 +1000
David Gibson
Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.
Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that we've implemented so far).
Signed-off-by: David Gibson
--- tcp.c | 46 +++++++++++++++++++--------------------------- tcp_splice.c | 40 ++++++++++++++++++++++++++-------------- tcp_splice.h | 3 +-- 3 files changed, 46 insertions(+), 43 deletions(-) diff --git a/tcp.c b/tcp.c index 297134f..7459fc2 100644 --- a/tcp.c +++ b/tcp.c @@ -2639,37 +2639,25 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - struct sockaddr *sa, const struct timespec *now) { char fsstr[FLOWSIDE_STRLEN];
+ ASSERT(flowside_complete(SOCKSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - inany_from_sockaddr(&SOCKSIDE(conn)->eaddr, &SOCKSIDE(conn)->eport, sa); - - ASSERT(flowside_complete(SOCKSIDE(conn))); - debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(conn), - flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); - TAPSIDE(conn)->faddr = SOCKSIDE(conn)->eaddr; TAPSIDE(conn)->fport = SOCKSIDE(conn)->eport; tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); @@ -2699,8 +2687,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);
tcp_get_sndbuf(conn); - - return true; }
/** @@ -2712,6 +2698,7 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { + char fsstr[FLOWSIDE_STRLEN]; struct sockaddr_storage sa; union flow *flow; socklen_t sl; @@ -2730,20 +2717,25 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) return;
- flow = flowtab + c->flow_count++; + flow = flowtab + c->flow_count;
- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_getsockname(&flow->f.side[0], s) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); return; + } + inany_from_sockaddr(&flow->f.side[0].eaddr, &flow->f.side[0].eport, + &sa); + c->flow_count++;
- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(flow), + flowside_fmt(&flow->f.side[0], fsstr, sizeof(fsstr))); + + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;
- /* Failed to create the connection */ - close(s); - c->flow_count--; + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); }
/** diff --git a/tcp_splice.c b/tcp_splice.c index 676e7e8..018d095 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,6 +73,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2];
+#define ASIDE(conn) (&(conn)->f.side[0]) +#define BSIDE(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -310,7 +313,16 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { - int i; + char fsstr[FLOWSIDE_STRLEN]; + int i, rc; + + rc = flowside_getsockname(BSIDE(conn), conn->b); + if (rc) + return rc; + + ASSERT(flowside_complete(BSIDE(conn))); + debug("TCP (splice): index %li, connection forwarded, %s", FLOW_IDX(conn), + flowside_fmt(BSIDE(conn), fsstr, sizeof(fsstr)));
conn->pipe_a_b[0] = conn->pipe_b_a[0] = -1; conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; @@ -386,10 +398,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET, &addr4.sin_addr); } + BSIDE(conn)->eport = port;
if (connect(conn->b, sa, sl)) { if (errno != EINPROGRESS) { @@ -480,33 +495,30 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with ASIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa) + struct tcp_splice_conn *conn, int s) { - const struct in_addr *a4; - union inany_addr aany; - in_port_t port; + const struct in_addr *e4 = inany_v4(&ASIDE(conn)->eaddr); + const struct in_addr *f4 = inany_v4(&ASIDE(conn)->faddr);
ASSERT(c->mode == MODE_PASTA); + ASSERT(flowside_complete(ASIDE(conn)));
- inany_from_sockaddr(&aany, &port, sa); - a4 = inany_v4(&aany); - - if (a4) { - if (!IN4_IS_ADDR_LOOPBACK(a4)) + if (e4) { + if (!IN4_IS_ADDR_LOOPBACK(e4)) return false; + ASSERT(f4 && IN4_IS_ADDR_LOOPBACK(f4));
I can't follow this: the test you're replacing is actually (still) a test used by tcp_listen_handler() unless I'm missing something. Returning false here should simply mean we can't use a spliced connection, not that something is wrong.
conn->flags = 0; } else { - if (!IN6_IS_ADDR_LOOPBACK(&aany.a6)) + if (!IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->eaddr.a6)) return false; + ASSERT(IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->faddr.a6));
...same here. Everything else in the series looks good to me! It looks simpler (so far) than I thought it would be. -- Stefano
On Thu, Sep 07, 2023 at 03:02:53AM +0200, Stefano Brivio wrote:
On Mon, 28 Aug 2023 15:41:46 +1000 David Gibson
wrote: Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.
Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that we've implemented so far).
Signed-off-by: David Gibson
--- tcp.c | 46 +++++++++++++++++++--------------------------- tcp_splice.c | 40 ++++++++++++++++++++++++++-------------- tcp_splice.h | 3 +-- 3 files changed, 46 insertions(+), 43 deletions(-) diff --git a/tcp.c b/tcp.c index 297134f..7459fc2 100644 --- a/tcp.c +++ b/tcp.c @@ -2639,37 +2639,25 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - struct sockaddr *sa, const struct timespec *now) { char fsstr[FLOWSIDE_STRLEN];
+ ASSERT(flowside_complete(SOCKSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - inany_from_sockaddr(&SOCKSIDE(conn)->eaddr, &SOCKSIDE(conn)->eport, sa); - - ASSERT(flowside_complete(SOCKSIDE(conn))); - debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(conn), - flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); - TAPSIDE(conn)->faddr = SOCKSIDE(conn)->eaddr; TAPSIDE(conn)->fport = SOCKSIDE(conn)->eport; tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); @@ -2699,8 +2687,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);
tcp_get_sndbuf(conn); - - return true; }
/** @@ -2712,6 +2698,7 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { + char fsstr[FLOWSIDE_STRLEN]; struct sockaddr_storage sa; union flow *flow; socklen_t sl; @@ -2730,20 +2717,25 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) return;
- flow = flowtab + c->flow_count++; + flow = flowtab + c->flow_count;
- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_getsockname(&flow->f.side[0], s) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); return; + } + inany_from_sockaddr(&flow->f.side[0].eaddr, &flow->f.side[0].eport, + &sa); + c->flow_count++;
- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(flow), + flowside_fmt(&flow->f.side[0], fsstr, sizeof(fsstr))); + + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;
- /* Failed to create the connection */ - close(s); - c->flow_count--; + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); }
/** diff --git a/tcp_splice.c b/tcp_splice.c index 676e7e8..018d095 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,6 +73,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2];
+#define ASIDE(conn) (&(conn)->f.side[0]) +#define BSIDE(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -310,7 +313,16 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { - int i; + char fsstr[FLOWSIDE_STRLEN]; + int i, rc; + + rc = flowside_getsockname(BSIDE(conn), conn->b); + if (rc) + return rc; + + ASSERT(flowside_complete(BSIDE(conn))); + debug("TCP (splice): index %li, connection forwarded, %s", FLOW_IDX(conn), + flowside_fmt(BSIDE(conn), fsstr, sizeof(fsstr)));
conn->pipe_a_b[0] = conn->pipe_b_a[0] = -1; conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; @@ -386,10 +398,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET, &addr4.sin_addr); } + BSIDE(conn)->eport = port;
if (connect(conn->b, sa, sl)) { if (errno != EINPROGRESS) { @@ -480,33 +495,30 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with ASIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa) + struct tcp_splice_conn *conn, int s) { - const struct in_addr *a4; - union inany_addr aany; - in_port_t port; + const struct in_addr *e4 = inany_v4(&ASIDE(conn)->eaddr); + const struct in_addr *f4 = inany_v4(&ASIDE(conn)->faddr);
ASSERT(c->mode == MODE_PASTA); + ASSERT(flowside_complete(ASIDE(conn)));
- inany_from_sockaddr(&aany, &port, sa); - a4 = inany_v4(&aany); - - if (a4) { - if (!IN4_IS_ADDR_LOOPBACK(a4)) + if (e4) { + if (!IN4_IS_ADDR_LOOPBACK(e4)) return false; + ASSERT(f4 && IN4_IS_ADDR_LOOPBACK(f4));
I can't follow this: the test you're replacing is actually (still) a test used by tcp_listen_handler() unless I'm missing something.
I'm not replacing a test - I'm just rewriting the same test, then adding an assert.
Returning false here should simply mean we can't use a spliced connection, not that something is wrong.
Yes. The 'if' checks if this is a spliceable connection - we have a loopback endpoint (remote to pasta) address. The assert is then checking that the forwarding address (local to pasta) is also loopback - i.e. that we don't have traffic that's going from 127.0.0.1 to a non-loopback address which would be nonsense. I guess this does indicate that the kernel has given us something weird, rather than an internal bug, so maybe it should be log an error and drop the connection rather than asserting.
conn->flags = 0; } else { - if (!IN6_IS_ADDR_LOOPBACK(&aany.a6)) + if (!IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->eaddr.a6)) return false; + ASSERT(IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->faddr.a6));
...same here.
Everything else in the series looks good to me! It looks simpler (so far) than I thought it would be.
-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 7 Sep 2023 14:14:04 +1000
David Gibson
On Thu, Sep 07, 2023 at 03:02:53AM +0200, Stefano Brivio wrote:
On Mon, 28 Aug 2023 15:41:46 +1000 David Gibson
wrote: Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.
Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that we've implemented so far).
Signed-off-by: David Gibson
--- tcp.c | 46 +++++++++++++++++++--------------------------- tcp_splice.c | 40 ++++++++++++++++++++++++++-------------- tcp_splice.h | 3 +-- 3 files changed, 46 insertions(+), 43 deletions(-) diff --git a/tcp.c b/tcp.c index 297134f..7459fc2 100644 --- a/tcp.c +++ b/tcp.c @@ -2639,37 +2639,25 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - struct sockaddr *sa, const struct timespec *now) { char fsstr[FLOWSIDE_STRLEN];
+ ASSERT(flowside_complete(SOCKSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - inany_from_sockaddr(&SOCKSIDE(conn)->eaddr, &SOCKSIDE(conn)->eport, sa); - - ASSERT(flowside_complete(SOCKSIDE(conn))); - debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(conn), - flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); - TAPSIDE(conn)->faddr = SOCKSIDE(conn)->eaddr; TAPSIDE(conn)->fport = SOCKSIDE(conn)->eport; tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); @@ -2699,8 +2687,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);
tcp_get_sndbuf(conn); - - return true; }
/** @@ -2712,6 +2698,7 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { + char fsstr[FLOWSIDE_STRLEN]; struct sockaddr_storage sa; union flow *flow; socklen_t sl; @@ -2730,20 +2717,25 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) return;
- flow = flowtab + c->flow_count++; + flow = flowtab + c->flow_count;
- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_getsockname(&flow->f.side[0], s) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); return; + } + inany_from_sockaddr(&flow->f.side[0].eaddr, &flow->f.side[0].eport, + &sa); + c->flow_count++;
- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(flow), + flowside_fmt(&flow->f.side[0], fsstr, sizeof(fsstr))); + + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;
- /* Failed to create the connection */ - close(s); - c->flow_count--; + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); }
/** diff --git a/tcp_splice.c b/tcp_splice.c index 676e7e8..018d095 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,6 +73,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2];
+#define ASIDE(conn) (&(conn)->f.side[0]) +#define BSIDE(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -310,7 +313,16 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { - int i; + char fsstr[FLOWSIDE_STRLEN]; + int i, rc; + + rc = flowside_getsockname(BSIDE(conn), conn->b); + if (rc) + return rc; + + ASSERT(flowside_complete(BSIDE(conn))); + debug("TCP (splice): index %li, connection forwarded, %s", FLOW_IDX(conn), + flowside_fmt(BSIDE(conn), fsstr, sizeof(fsstr)));
conn->pipe_a_b[0] = conn->pipe_b_a[0] = -1; conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; @@ -386,10 +398,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET, &addr4.sin_addr); } + BSIDE(conn)->eport = port;
if (connect(conn->b, sa, sl)) { if (errno != EINPROGRESS) { @@ -480,33 +495,30 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with ASIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa) + struct tcp_splice_conn *conn, int s) { - const struct in_addr *a4; - union inany_addr aany; - in_port_t port; + const struct in_addr *e4 = inany_v4(&ASIDE(conn)->eaddr); + const struct in_addr *f4 = inany_v4(&ASIDE(conn)->faddr);
ASSERT(c->mode == MODE_PASTA); + ASSERT(flowside_complete(ASIDE(conn)));
- inany_from_sockaddr(&aany, &port, sa); - a4 = inany_v4(&aany); - - if (a4) { - if (!IN4_IS_ADDR_LOOPBACK(a4)) + if (e4) { + if (!IN4_IS_ADDR_LOOPBACK(e4)) return false; + ASSERT(f4 && IN4_IS_ADDR_LOOPBACK(f4));
I can't follow this: the test you're replacing is actually (still) a test used by tcp_listen_handler() unless I'm missing something.
I'm not replacing a test - I'm just rewriting the same test, then adding an assert.
Gah, sorry, I misread!
Returning false here should simply mean we can't use a spliced connection, not that something is wrong.
Yes. The 'if' checks if this is a spliceable connection - we have a loopback endpoint (remote to pasta) address. The assert is then checking that the forwarding address (local to pasta) is also loopback - i.e. that we don't have traffic that's going from 127.0.0.1 to a non-loopback address which would be nonsense.
I guess this does indicate that the kernel has given us something weird, rather than an internal bug, so maybe it should be log an error and drop the connection rather than asserting.
Sure, I see now. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio