[PATCH v3 00/16] Introduce unified flow table, first steps
Here's my latest revision of some of the basics of the flow table. So far it's basically just a renaming of the existing TCP connection table, along with some associated helpers. It's used for some new logging infrastructure, but otherwise doesn't really function any differently. However, this subset of the flow table work no longer bloats flow/connection entries over a single cache line. That removes the most prominent drawback of earlier revisions, meaning I think this series is ready for merge now. Doing so will mean the later series making more substantive changes to the flow behaviour are simpler. Tested on top of the patch updating shell prompt escape handling, but should be independent of it. Changes since v2: * Added a patch to only use C11 static_assert(), not C23 static_assert() (needed for next change) [1/16] * Better handling of the bounds on valid values of enum flow_type [2/16] * No longer introduce an additional C23 style static_assert() [8/16] * Add fix for overly long guestfish commands (needed for next change) [13/16] * Added a patch supporting names for pifs [14/16] * Added patches with some further TCP reworks in preparation for the general flow table [15-16/16] Changes since v1: * Removed a inaccurate stale comment * Added doc comment to FLOW() macro * Added new patches cleaning up signedness of 'side' variables * Added new patches introducing "sidx"s (flow+side indices) David Gibson (16): treewide: Add messages to static_assert() calls flow, tcp: Generalise connection types flow, tcp: Move TCP connection table to unified flow table flow, tcp: Consolidate flow pointer<->index helpers util: MAX_FROM_BITS() should be unsigned flow: Make unified version of flow table compaction flow, tcp: Add logging helpers for connection related messages flow: Introduce 'sidx' type to represent one side of one flow tcp: Remove unneccessary bounds check in tcp_timer_handler() flow,tcp: Generalise TCP epoll_ref to generic flows tcp_splice: Use unsigned to represent side flow,tcp: Use epoll_ref type including flow and side test: Avoid hitting guestfish command length limits pif: Add helpers to get the name of a pif tcp: "TCP" hash secret doesn't need to be TCP specific tcp: Don't defer hash table removal Makefile | 15 +- flow.c | 86 +++++++++++ flow.h | 74 ++++++++++ flow_table.h | 86 +++++++++++ inany.h | 6 +- passt.c | 40 ++++++ passt.h | 15 +- pif.c | 21 +++ pif.h | 19 +++ tcp.c | 288 ++++++++++++++++--------------------- tcp.h | 7 - tcp_conn.h | 46 ++---- tcp_splice.c | 128 +++++++---------- tcp_splice.h | 2 +- test/prepare-distro-img.sh | 2 +- util.h | 2 +- 16 files changed, 540 insertions(+), 297 deletions(-) create mode 100644 flow.c create mode 100644 flow.h create mode 100644 flow_table.h create mode 100644 pif.c -- 2.43.0
A while ago, we updated passt to require C11, for several reasons, but one
was to be able to use static_assert() for build time checks. The C11
version of static_assert() requires a message to print in case of failure
as well as the test condition it self. It was extended in C23 to make the
message optional, but we don't want to require C23 at this time.
Unfortunately we missed that in some of the static_assert()s we already
added which use the C23 form without a message. clang-tidy has a warning
for this, but for some reason it's not seeming to trigger in the current
cases (but did for some I'm working on adding).
Signed-off-by: David Gibson
Currently TCP connections use a 1-bit selector, 'spliced', to determine the
rest of the contents of the structure. We want to generalise the TCP
connection table to other types of flows in other protocols. Make a start
on this by replacing the tcp_conn_common structure with a new flow_common
structure with an enum rather than a simple boolean indicating the type of
flow.
Signed-off-by: David Gibson
We want to generalise "connection" tracking to things other than true TCP
connections. Continue implenenting this by renaming the TCP connection
table to the "flow table" and moving it to flow.c. The definitions are
split between flow.h and flow_table.h - we need this separation to avoid
circular dependencies: the definitions in flow.h will be needed by many
headers using the flow mechanism, but flow_table.h needs all those protocol
specific headers in order to define the full flow table entry.
Signed-off-by: David Gibson
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index
of their connection structures in the connection table, now become the
unified flow table. We can easily combine these into a common helper.
While we're there, add some trickery for some additional type safety.
They also define their own CONN() versions, which aren't so easily combined
since they need to return different types, but we can have them use a
common helper.
In the process, we standardise on always using an unsigned type to store
the connection / flow index, which makes more sense. tcp.c's conn_at_idx()
remains for now, but we change its parameter to unsigned to match. That in
turn means we can remove a check for negative values from it.
Signed-off-by: David Gibson
MAX_FROM_BITS() computes the maximum value representable in a number of
bits. The expression for that is an unsigned value, but we explicitly cast
it to a signed int. It looks like this is because one of the main users is
for FD_REF_MAX, which is used to bound fd values, typically stored as a
signed int.
The value MAX_FROM_BITS() is calculating is naturally non-negative, though,
so it makes more sense for it to be unsigned, and to move the case to the
definition of FD_REF_MAX.
Signed-off-by: David Gibson
tcp_table_compact() will move entries in the connection/flow table to keep
it compact when other entries are removed. The moved entries need not have
the same type as the flow removed, so it needs to be able to handle moving
any type of flow. Therefore, move it to flow.c rather than being
purportedly TCP specific.
Signed-off-by: David Gibson
Most of the messages logged by the TCP code (be they errors, debug or
trace messages) are related to a specific connection / flow. We're fairly
consistent about prefixing these with the type of connection and the
connection / flow index. However there are a few places where we put the
index later in the message or omit it entirely. The template with the
prefix is also a little bulky to carry around for every message,
particularly for spliced connections.
To help keep this consistent, introduce some helpers to log messages
linked to a specific flow. It takes the flow as a parameter and adds a
uniform prefix to each message. This makes things slightly neater now, but
more importantly will help keep formatting consistent as we add more things
to the flow table.
Signed-off-by: David Gibson
In a number of places, we use indices into the flow table to identify a
specific flow. We also have cases where we need to identify a particular
side of a particular flow, and we expect those to become more common as
we generalise the flow table to cover more things.
To assist with that, introduces flow_sidx_t, an index type which identifies
a specific side of a specific flow in the table.
Signed-off-by: David Gibson
On Thu, 30 Nov 2023 13:02:14 +1100
David Gibson
In a number of places, we use indices into the flow table to identify a specific flow. We also have cases where we need to identify a particular side of a particular flow, and we expect those to become more common as we generalise the flow table to cover more things.
To assist with that, introduces flow_sidx_t, an index type which identifies a specific side of a specific flow in the table.
Signed-off-by: David Gibson
--- flow.h | 14 ++++++++++++++ flow_table.h | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/flow.h b/flow.h index c820a15..4f12831 100644 --- a/flow.h +++ b/flow.h @@ -39,6 +39,20 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
+/** + * struct flow_sidx - ID for one side of a specific flow + * @side: Side referenced (0 or 1) + * @flow: Index of flow referenced + */ +typedef struct flow_sidx { + int side :1; + unsigned flow :FLOW_INDEX_BITS; +} flow_sidx_t; +static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t), + "flow_sidx_t must fit within 32 bits"); + +#define FLOW_SIDX_NONE ((flow_sidx_t){ .flow = FLOW_MAX }) + union flow;
void flow_table_compact(struct ctx *c, union flow *hole); diff --git a/flow_table.h b/flow_table.h index 5e897bd..3c68d4a 100644 --- a/flow_table.h +++ b/flow_table.h @@ -47,4 +47,40 @@ static inline unsigned flow_idx(const struct flow_common *f) */ #define FLOW(idx) (&flowtab[(idx)])
+/** flow_at_sidx - Flow entry for a given sidx + * @sidx: Flow & side index + * + * Return: pointer to the corresponding flow entry, or NULL + */ +static inline union flow *flow_at_sidx(flow_sidx_t sidx) +{ + if (sidx.flow >= FLOW_MAX) + return NULL; + return FLOW(sidx.flow); +} + +/** flow_sidx_t - Index of one side of a flow from common structure + * @f: Common flow fields pointer + * @side: Which side to refer to (0 or 1) + * + * Return: index of @f and @side in the flow table + */ +static inline flow_sidx_t flow_sidx(const struct flow_common *f, + int side) +{ + ASSERT(side == !!side);
This gives me a false cppcheck (2.10) positive: flow_table.h:71:2: style: The comparison 'side == !!side' is always true because 'side' and '!!side' represent the same value. [knownConditionTrueFalse] ASSERT(side == !!side); ^ I guess cppcheck commit 5d1fdf795829 ("Fix issue 7904: Handle double nots in isSameExpression (#1305)") should be extended to also skip checking double nots in equality comparisons: diff --git a/lib/astutils.cpp b/lib/astutils.cpp index 2c01884c9..8e88ec817 100644 --- a/lib/astutils.cpp +++ b/lib/astutils.cpp @@ -1424,10 +1424,10 @@ bool isSameExpression(bool cpp, bool macro, const Token *tok1, const Token *tok2 tok2 = tok2->astOperand2(); } // Skip double not - if (Token::simpleMatch(tok1, "!") && Token::simpleMatch(tok1->astOperand1(), "!") && !Token::simpleMatch(tok1->astParent(), "=")) { + if (Token::simpleMatch(tok1, "!") && Token::simpleMatch(tok1->astOperand1(), "!") && !Token::simpleMatch(tok1->astParent(), "=") && !Token::simpleMatch(tok1->astParent(), "==")) { return isSameExpression(cpp, macro, tok1->astOperand1()->astOperand1(), tok2, library, pure, followVar, errors); } - if (Token::simpleMatch(tok2, "!") && Token::simpleMatch(tok2->astOperand1(), "!") && !Token::simpleMatch(tok2->astParent(), "=")) { + if (Token::simpleMatch(tok2, "!") && Token::simpleMatch(tok2->astOperand1(), "!") && !Token::simpleMatch(tok2->astParent(), "=") && !Token::simpleMatch(tok2->astParent(), "==")) { return isSameExpression(cpp, macro, tok1, tok2->astOperand1()->astOperand1(), library, pure, followVar, errors); } const bool tok_str_eq = tok1->str() == tok2->str(); ...anyway, I would just add an inline suppression for the moment, unless you have better ideas. -- Stefano
In tcp_timer_handler() we use conn_at_idx() to interpret the flow index
from the epoll reference. However, this will never be NULL - we always
put a valid index into the epoll_ref. Simplify slightly based on this.
Signed-off-by: David Gibson
TCP uses three different epoll object types: one for connected sockets, one
for timers and one for listening sockets. Listening sockets really need
information that's specific to TCP, so need their own epoll_ref field.
Timers and connected sockets, however, only need the connection (flow)
they're associated with. As we expand the use of the flow table, we expect
that to be true for more epoll fds. So, rename the "TCP" epoll_ref field
to be a "flow" epoll_ref field that can be used both for TCP and for other
future cases.
Signed-off-by: David Gibson
Currently, we use 'int' values to represent the "side" of a connection,
which must always be 0 or 1. This turns out to be dangerous.
In some cases we're going to want to put the side into a 1-bit bitfield.
However, if that bitfield has type 'int', when we copy it out to a regular
'int' variable, it will be sign-extended and so have values 0 and -1,
instead of 0 and 1.
To avoid this, always use unsigned variables for the side.
Signed-off-by: David Gibson
Currently TCP uses the 'flow' epoll_ref field for both connected
sockets and timers, which consists of just the index of the relevant
flow (connection).
This is just fine for timers, for while it obviously works, it's
subtly incomplete for sockets on spliced connections. In that case we
want to know which side of the connection the event is occurring on as
well as which connection. At present, we deduce that information by
looking at the actual fd, and comparing it to the fds of the sockets
on each side.
When we use the flow table for more things, we expect more cases where
something will need to know a specific side of a specific flow for an
event, but nothing more.
Therefore add a new 'flowside' epoll_ref field, with exactly that
information. We use it for TCP connected sockets. This allows us to
directly know the side for spliced connections. For "tap"
connections, it's pretty meaningless, since the side is always the
socket side. It still makes logical sense though, and it may become
important for future flow table work.
Signed-off-by: David Gibson
In test/prepare-distro-img.sh we use guestfish to tweak our distro guest
images to be suitable. Part of this is using a 'copy-in' directive to copy
in the source files for passt itself. Currently we copy in all the files
with a single 'copy-in', since it allows listing multiple files. However
it turns out that the number of arguments it can accept is fairly limited
and our current list of files is already very close to that limit.
Instead, expand our list of files to one copy-in per file, avoiding that
limitation. This isn't much slower, because all the commands still run in
a single invocation of guestfish itself.
Signed-off-by: David Gibson
Future debugging will want to identify a specific passt interface. We make
a distinction in these helpers between the name of the *type* of pif, and
name of the pif itself. For the time being these are always the same
thing, since we have at most instance of each type of pif. However, that
might change in future.
Signed-off-by: David Gibson
On Thu, 30 Nov 2023 13:02:20 +1100
David Gibson
Future debugging will want to identify a specific passt interface. We make a distinction in these helpers between the name of the *type* of pif, and name of the pif itself. For the time being these are always the same thing, since we have at most instance of each type of pif. However, that might change in future.
Signed-off-by: David Gibson
--- Makefile | 3 ++- pif.c | 21 +++++++++++++++++++++ pif.h | 19 +++++++++++++++++++ 3 files changed, 42 insertions(+), 1 deletion(-) create mode 100644 pif.c diff --git a/Makefile b/Makefile index d14b029..af4fa87 100644 --- a/Makefile +++ b/Makefile @@ -46,7 +46,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \ igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \ - passt.c pasta.c pcap.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c util.c + passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \ + util.c QRAP_SRCS = qrap.c SRCS = $(PASST_SRCS) $(QRAP_SRCS)
diff --git a/pif.c b/pif.c new file mode 100644 index 0000000..ebf01cc --- /dev/null +++ b/pif.c @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson
+ * + * Passt/pasta interface types and IDs + */ + +#include +#include + +#include "util.h" +#include "pif.h" + +const char *pif_type_str[] = { + [PIF_NONE] = "<none>", + [PIF_HOST] = "HOST", + [PIF_TAP] = "TAP", + [PIF_SPLICE] = "SPLICE",
Nit, not worth respinning in my opinion: those are just English nouns, not acronyms, so they could simply be printed as "host", "tap", and "splice". -- Stefano
On Thu, Nov 30, 2023 at 01:45:25PM +0100, Stefano Brivio wrote:
On Thu, 30 Nov 2023 13:02:20 +1100 David Gibson
wrote: Future debugging will want to identify a specific passt interface. We make a distinction in these helpers between the name of the *type* of pif, and name of the pif itself. For the time being these are always the same thing, since we have at most instance of each type of pif. However, that might change in future.
Signed-off-by: David Gibson
--- Makefile | 3 ++- pif.c | 21 +++++++++++++++++++++ pif.h | 19 +++++++++++++++++++ 3 files changed, 42 insertions(+), 1 deletion(-) create mode 100644 pif.c diff --git a/Makefile b/Makefile index d14b029..af4fa87 100644 --- a/Makefile +++ b/Makefile @@ -46,7 +46,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \ igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \ - passt.c pasta.c pcap.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c util.c + passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \ + util.c QRAP_SRCS = qrap.c SRCS = $(PASST_SRCS) $(QRAP_SRCS)
diff --git a/pif.c b/pif.c new file mode 100644 index 0000000..ebf01cc --- /dev/null +++ b/pif.c @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson
+ * + * Passt/pasta interface types and IDs + */ + +#include +#include + +#include "util.h" +#include "pif.h" + +const char *pif_type_str[] = { + [PIF_NONE] = "<none>", + [PIF_HOST] = "HOST", + [PIF_TAP] = "TAP", + [PIF_SPLICE] = "SPLICE", Nit, not worth respinning in my opinion: those are just English nouns, not acronyms, so they could simply be printed as "host", "tap", and "splice".
I'd been thinking of pif names in caps to make them stand out from other things, but it doesn't really matter that much. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
The TCP state structure includes a 128-bit hash_secret which we use for
SipHash calculations to mitigate attacks on the TCP hash table and initial
sequence number.
We have plans to use SipHash in places that aren't TCP related, and there's
no particular reason they'd need their own secret. So move the hash_secret
to the general context structure.
Signed-off-by: David Gibson
When a TCP connection is closed, we mark it by setting events to CLOSED,
then some time later we do final cleanups: closing sockets, removing from
the hash table and so forth.
This does mean that when making a hash lookup we need to exclude any
apparent matches that are CLOSED, since they represent a stale connection.
This can happen in practice if one connection closes and a new one with the
same endpoints is started shortly afterward.
Checking for CLOSED is quite specific to TCP however, and won't work when
we extend the hash table to more general flows. So, alter the code to
immediately remove the connection from the hash table when CLOSED, although
we still defer closing sockets and other cleanup.
Signed-off-by: David Gibson
On Thu, 30 Nov 2023 13:02:22 +1100
David Gibson
When a TCP connection is closed, we mark it by setting events to CLOSED, then some time later we do final cleanups: closing sockets, removing from the hash table and so forth.
This does mean that when making a hash lookup we need to exclude any apparent matches that are CLOSED, since they represent a stale connection. This can happen in practice if one connection closes and a new one with the same endpoints is started shortly afterward.
Checking for CLOSED is quite specific to TCP however, and won't work when we extend the hash table to more general flows. So, alter the code to immediately remove the connection from the hash table when CLOSED, although we still defer closing sockets and other cleanup.
Signed-off-by: David Gibson
--- tcp.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index 74d06bf..17c7cba 100644 --- a/tcp.c +++ b/tcp.c @@ -781,6 +781,9 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); }
+static void tcp_hash_remove(const struct ctx *c, + const struct tcp_tap_conn *conn); + /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context @@ -825,7 +828,9 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, flow_dbg(conn, "%s", num == -1 ? "CLOSED" : tcp_event_str[num]);
- if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) + if (event == CLOSED) + tcp_hash_remove(c, conn); + else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) conn_flag(c, conn, ACTIVE_CLOSE); else tcp_epoll_ctl(c, conn); @@ -1150,7 +1155,7 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (conn->events != CLOSED && inany_equals(&conn->faddr, faddr) && + if (inany_equals(&conn->faddr, faddr) && conn->eport == eport && conn->fport == fport) return 1;
@@ -1308,7 +1313,6 @@ static void tcp_conn_destroy(struct ctx *c, union flow *flow) if (conn->timer != -1) close(conn->timer);
- tcp_hash_remove(c, conn); flow_table_compact(c, flow);
I was pretty sure, due to the way I originally implemented this, that removing an entry from the hash table without compacting the table afterwards, with an event possibly coming between the two, would present some inconsistency while we're handling that event. But looking at it now, I don't see any issue with this. I just wanted to raise it in case you're aware of (but didn't think about) some concern in this sense. By the way, the reason why I deferred tcp_hash_remove() back then was to save cycles between epoll events and get higher CRR rates, but I think the effect is negligible anyway. -- Stefano
On Thu, Nov 30, 2023 at 01:45:32PM +0100, Stefano Brivio wrote:
On Thu, 30 Nov 2023 13:02:22 +1100 David Gibson
wrote: When a TCP connection is closed, we mark it by setting events to CLOSED, then some time later we do final cleanups: closing sockets, removing from the hash table and so forth.
This does mean that when making a hash lookup we need to exclude any apparent matches that are CLOSED, since they represent a stale connection. This can happen in practice if one connection closes and a new one with the same endpoints is started shortly afterward.
Checking for CLOSED is quite specific to TCP however, and won't work when we extend the hash table to more general flows. So, alter the code to immediately remove the connection from the hash table when CLOSED, although we still defer closing sockets and other cleanup.
Signed-off-by: David Gibson
--- tcp.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index 74d06bf..17c7cba 100644 --- a/tcp.c +++ b/tcp.c @@ -781,6 +781,9 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); }
+static void tcp_hash_remove(const struct ctx *c, + const struct tcp_tap_conn *conn); + /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context @@ -825,7 +828,9 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, flow_dbg(conn, "%s", num == -1 ? "CLOSED" : tcp_event_str[num]);
- if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) + if (event == CLOSED) + tcp_hash_remove(c, conn); + else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) conn_flag(c, conn, ACTIVE_CLOSE); else tcp_epoll_ctl(c, conn); @@ -1150,7 +1155,7 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (conn->events != CLOSED && inany_equals(&conn->faddr, faddr) && + if (inany_equals(&conn->faddr, faddr) && conn->eport == eport && conn->fport == fport) return 1;
@@ -1308,7 +1313,6 @@ static void tcp_conn_destroy(struct ctx *c, union flow *flow) if (conn->timer != -1) close(conn->timer);
- tcp_hash_remove(c, conn); flow_table_compact(c, flow);
I was pretty sure, due to the way I originally implemented this, that removing an entry from the hash table without compacting the table afterwards, with an event possibly coming between the two, would present some inconsistency while we're handling that event.
But looking at it now, I don't see any issue with this. I just wanted to raise it in case you're aware of (but didn't think about) some concern in this sense.
I think it's ok. The thing is that compacting the connection table itself is largely independent of the hash table, whose buckets are separately indexed. A hash remove shuffles things around in the hash buckets, but doesn't change where connections sit in the connection table. A connection table compaction changes the indices in the connection table, which requires updating things in the hash buckets, but not moving things around in the buckets - exactly the same entries are in every hash bucket, it's just that one of them has a new "name" now.
By the way, the reason why I deferred tcp_hash_remove() back then was to save cycles between epoll events and get higher CRR rates, but I think the effect is negligible anyway.
Right.. to process a FIN and the next SYN at once, I guess? I figured this might make a difference, but probably not much. There's no syscall here, and batching doesn't reduce the total amount of work in this case. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Fri, 1 Dec 2023 11:07:50 +1100
David Gibson
On Thu, Nov 30, 2023 at 01:45:32PM +0100, Stefano Brivio wrote:
On Thu, 30 Nov 2023 13:02:22 +1100 David Gibson
wrote: When a TCP connection is closed, we mark it by setting events to CLOSED, then some time later we do final cleanups: closing sockets, removing from the hash table and so forth.
This does mean that when making a hash lookup we need to exclude any apparent matches that are CLOSED, since they represent a stale connection. This can happen in practice if one connection closes and a new one with the same endpoints is started shortly afterward.
Checking for CLOSED is quite specific to TCP however, and won't work when we extend the hash table to more general flows. So, alter the code to immediately remove the connection from the hash table when CLOSED, although we still defer closing sockets and other cleanup.
Signed-off-by: David Gibson
--- tcp.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index 74d06bf..17c7cba 100644 --- a/tcp.c +++ b/tcp.c @@ -781,6 +781,9 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); }
+static void tcp_hash_remove(const struct ctx *c, + const struct tcp_tap_conn *conn); + /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context @@ -825,7 +828,9 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, flow_dbg(conn, "%s", num == -1 ? "CLOSED" : tcp_event_str[num]);
- if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) + if (event == CLOSED) + tcp_hash_remove(c, conn); + else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) conn_flag(c, conn, ACTIVE_CLOSE); else tcp_epoll_ctl(c, conn); @@ -1150,7 +1155,7 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (conn->events != CLOSED && inany_equals(&conn->faddr, faddr) && + if (inany_equals(&conn->faddr, faddr) && conn->eport == eport && conn->fport == fport) return 1;
@@ -1308,7 +1313,6 @@ static void tcp_conn_destroy(struct ctx *c, union flow *flow) if (conn->timer != -1) close(conn->timer);
- tcp_hash_remove(c, conn); flow_table_compact(c, flow);
I was pretty sure, due to the way I originally implemented this, that removing an entry from the hash table without compacting the table afterwards, with an event possibly coming between the two, would present some inconsistency while we're handling that event.
But looking at it now, I don't see any issue with this. I just wanted to raise it in case you're aware of (but didn't think about) some concern in this sense.
I think it's ok. The thing is that compacting the connection table itself is largely independent of the hash table, whose buckets are separately indexed. A hash remove shuffles things around in the hash buckets, but doesn't change where connections sit in the connection table. A connection table compaction changes the indices in the connection table, which requires updating things in the hash buckets, but not moving things around in the buckets - exactly the same entries are in every hash bucket, it's just that one of them has a new "name" now.
By the way, the reason why I deferred tcp_hash_remove() back then was to save cycles between epoll events and get higher CRR rates, but I think the effect is negligible anyway.
Right.. to process a FIN and the next SYN at once, I guess?
That's one example, yes, but in any case it was an optimisation for...
I figured this might make a difference, but probably not much. There's no syscall here, and batching doesn't reduce the total amount of work in this case.
supposedly better data locality, with batching. But never micro-benchmarked, and surely negligible anyway. -- Stefano
On Sat, Dec 02, 2023 at 05:34:58AM +0100, Stefano Brivio wrote:
On Fri, 1 Dec 2023 11:07:50 +1100 David Gibson
wrote: On Thu, Nov 30, 2023 at 01:45:32PM +0100, Stefano Brivio wrote:
On Thu, 30 Nov 2023 13:02:22 +1100 David Gibson
wrote: When a TCP connection is closed, we mark it by setting events to CLOSED, then some time later we do final cleanups: closing sockets, removing from the hash table and so forth.
This does mean that when making a hash lookup we need to exclude any apparent matches that are CLOSED, since they represent a stale connection. This can happen in practice if one connection closes and a new one with the same endpoints is started shortly afterward.
Checking for CLOSED is quite specific to TCP however, and won't work when we extend the hash table to more general flows. So, alter the code to immediately remove the connection from the hash table when CLOSED, although we still defer closing sockets and other cleanup.
Signed-off-by: David Gibson
--- tcp.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index 74d06bf..17c7cba 100644 --- a/tcp.c +++ b/tcp.c @@ -781,6 +781,9 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); }
+static void tcp_hash_remove(const struct ctx *c, + const struct tcp_tap_conn *conn); + /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context @@ -825,7 +828,9 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, flow_dbg(conn, "%s", num == -1 ? "CLOSED" : tcp_event_str[num]);
- if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) + if (event == CLOSED) + tcp_hash_remove(c, conn); + else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) conn_flag(c, conn, ACTIVE_CLOSE); else tcp_epoll_ctl(c, conn); @@ -1150,7 +1155,7 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (conn->events != CLOSED && inany_equals(&conn->faddr, faddr) && + if (inany_equals(&conn->faddr, faddr) && conn->eport == eport && conn->fport == fport) return 1;
@@ -1308,7 +1313,6 @@ static void tcp_conn_destroy(struct ctx *c, union flow *flow) if (conn->timer != -1) close(conn->timer);
- tcp_hash_remove(c, conn); flow_table_compact(c, flow);
I was pretty sure, due to the way I originally implemented this, that removing an entry from the hash table without compacting the table afterwards, with an event possibly coming between the two, would present some inconsistency while we're handling that event.
But looking at it now, I don't see any issue with this. I just wanted to raise it in case you're aware of (but didn't think about) some concern in this sense.
I think it's ok. The thing is that compacting the connection table itself is largely independent of the hash table, whose buckets are separately indexed. A hash remove shuffles things around in the hash buckets, but doesn't change where connections sit in the connection table. A connection table compaction changes the indices in the connection table, which requires updating things in the hash buckets, but not moving things around in the buckets - exactly the same entries are in every hash bucket, it's just that one of them has a new "name" now.
By the way, the reason why I deferred tcp_hash_remove() back then was to save cycles between epoll events and get higher CRR rates, but I think the effect is negligible anyway.
Right.. to process a FIN and the next SYN at once, I guess?
That's one example, yes, but in any case it was an optimisation for...
I figured this might make a difference, but probably not much. There's no syscall here, and batching doesn't reduce the total amount of work in this case.
supposedly better data locality, with batching. But never micro-benchmarked, and surely negligible anyway.
Hmm.. except hash tables are by construction non-local, so I wouldn't really expect batching unrelated hash entries to do that. Even if the connection table entries themselves are close, which is more likely, they take a cacheline each on the most common platform, so that's not likely to win anything. In fact if anything, I'd expect better locality with the non-deferred approach - we're triggering the hash remove when we've already been working on that connection, so it should be cache hot. I guess batching does win some icache locality. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 30 Nov 2023 13:02:06 +1100
David Gibson
Here's my latest revision of some of the basics of the flow table. So far it's basically just a renaming of the existing TCP connection table, along with some associated helpers. It's used for some new logging infrastructure, but otherwise doesn't really function any differently.
However, this subset of the flow table work no longer bloats flow/connection entries over a single cache line. That removes the most prominent drawback of earlier revisions, meaning I think this series is ready for merge now. Doing so will mean the later series making more substantive changes to the flow behaviour are simpler.
Tested on top of the patch updating shell prompt escape handling, but should be independent of it.
Changes since v2: * Added a patch to only use C11 static_assert(), not C23 static_assert() (needed for next change) [1/16] * Better handling of the bounds on valid values of enum flow_type [2/16] * No longer introduce an additional C23 style static_assert() [8/16] * Add fix for overly long guestfish commands (needed for next change) [13/16] * Added a patch supporting names for pifs [14/16] * Added patches with some further TCP reworks in preparation for the general flow table [15-16/16] Changes since v1: * Removed a inaccurate stale comment * Added doc comment to FLOW() macro * Added new patches cleaning up signedness of 'side' variables * Added new patches introducing "sidx"s (flow+side indices)
Applied. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio