[PATCH v2 00/11] Introduce unified flow table, first steps
Here's my latest revision of some of the basics of the flow table. So far it's basically just a renaming of the existing TCP connection table, along with some associated helpers. It's used for some new logging infrastructure, but otherwise doesn't really function any differently. However, this subset of the flow table work no longer bloats flow/connection entries over a single cache line. That removes the most prominent drawback of earlier revisions, meaning I think this series is ready for merge now. Doing so will mean the later series making more substantive changes to the flow behaviour are simpler. Tested on top of the patch updating shell prompt escape handling, but should be independent of it. Changes since v1: * Removed a inaccurate stale comment * Added doc comment to FLOW() macro * Added new patches cleaning up signedness of 'side' variables * Added new patches introducing "sidx"s (flow+side indices) David Gibson (11): flow, tcp: Generalise connection types flow, tcp: Move TCP connection table to unified flow table flow, tcp: Consolidate flow pointer<->index helpers util: MAX_FROM_BITS() should be unsigned flow: Make unified version of flow table compaction flow, tcp: Add logging helpers for connection related messages flow: Introduce 'sidx' type to represent one side of one flow tcp: Remove unneccessary bounds check in tcp_timer_handler() flow,tcp: Generalise TCP epoll_ref to generic flows tcp_splice: Use unsigned to represent side flow,tcp: Use epoll_ref type including flow and side Makefile | 14 +-- flow.c | 84 ++++++++++++++++++ flow.h | 73 ++++++++++++++++ flow_table.h | 86 ++++++++++++++++++ passt.h | 13 ++- tcp.c | 243 ++++++++++++++++++++++++--------------------------- tcp.h | 5 -- tcp_conn.h | 46 +++------- tcp_splice.c | 128 ++++++++++++--------------- tcp_splice.h | 2 +- util.h | 2 +- 11 files changed, 440 insertions(+), 256 deletions(-) create mode 100644 flow.c create mode 100644 flow.h create mode 100644 flow_table.h -- 2.43.0
Currently TCP connections use a 1-bit selector, 'spliced', to determine the
rest of the contents of the structure. We want to generalise the TCP
connection table to other types of flows in other protocols. Make a start
on this by replacing the tcp_conn_common structure with a new flow_common
structure with an enum rather than a simple boolean indicating the type of
flow.
Signed-off-by: David Gibson
We want to generalise "connection" tracking to things other than true TCP
connections. Continue implenenting this by renaming the TCP connection
table to the "flow table" and moving it to flow.c. The definitions are
split between flow.h and flow_table.h - we need this separation to avoid
circular dependencies: the definitions in flow.h will be needed by many
headers using the flow mechanism, but flow_table.h needs all those protocol
specific headers in order to define the full flow table entry.
Signed-off-by: David Gibson
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index
of their connection structures in the connection table, now become the
unified flow table. We can easily combine these into a common helper.
While we're there, add some trickery for some additional type safety.
They also define their own CONN() versions, which aren't so easily combined
since they need to return different types, but we can have them use a
common helper.
In the process, we standardise on always using an unsigned type to store
the connection / flow index, which makes more sense. tcp.c's conn_at_idx()
remains for now, but we change its parameter to unsigned to match. That in
turn means we can remove a check for negative values from it.
Signed-off-by: David Gibson
MAX_FROM_BITS() computes the maximum value representable in a number of
bits. The expression for that is an unsigned value, but we explicitly cast
it to a signed int. It looks like this is because one of the main users is
for FD_REF_MAX, which is used to bound fd values, typically stored as a
signed int.
The value MAX_FROM_BITS() is calculating is naturally non-negative, though,
so it makes more sense for it to be unsigned, and to move the case to the
definition of FD_REF_MAX.
Signed-off-by: David Gibson
tcp_table_compact() will move entries in the connection/flow table to keep
it compact when other entries are removed. The moved entries need not have
the same type as the flow removed, so it needs to be able to handle moving
any type of flow. Therefore, move it to flow.c rather than being
purportedly TCP specific.
Signed-off-by: David Gibson
Most of the messages logged by the TCP code (be they errors, debug or
trace messages) are related to a specific connection / flow. We're fairly
consistent about prefixing these with the type of connection and the
connection / flow index. However there are a few places where we put the
index later in the message or omit it entirely. The template with the
prefix is also a little bulky to carry around for every message,
particularly for spliced connections.
To help keep this consistent, introduce some helpers to log messages
linked to a specific flow. It takes the flow as a parameter and adds a
uniform prefix to each message. This makes things slightly neater now, but
more importantly will help keep formatting consistent as we add more things
to the flow table.
Signed-off-by: David Gibson
In a number of places, we use indices into the flow table to identify a
specific flow. We also have cases where we need to identify a particular
side of a particular flow, and we expect those to become more common as
we generalise the flow table to cover more things.
To assist with that, introduces flow_sidx_t, an index type which identifies
a specific side of a specific flow in the table.
Signed-off-by: David Gibson
On Mon, 27 Nov 2023 10:33:44 +1100
David Gibson
In a number of places, we use indices into the flow table to identify a specific flow. We also have cases where we need to identify a particular side of a particular flow, and we expect those to become more common as we generalise the flow table to cover more things.
To assist with that, introduces flow_sidx_t, an index type which identifies a specific side of a specific flow in the table.
Signed-off-by: David Gibson
--- flow.h | 13 +++++++++++++ flow_table.h | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+) diff --git a/flow.h b/flow.h index b6da516..3c90bbd 100644 --- a/flow.h +++ b/flow.h @@ -39,6 +39,19 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
+/** + * struct flow_sidx - ID for one side of a specific flow + * @side: Side referenced (0 or 1) + * @flow: Index of flow referenced + */ +typedef struct flow_sidx {
Implying my usual argument :) ...is there any advantage over using this simply as a struct?
+ int side :1; + unsigned flow :FLOW_INDEX_BITS; +} flow_sidx_t;
-- Stefano
On Wed, Nov 29, 2023 at 03:32:32PM +0100, Stefano Brivio wrote:
On Mon, 27 Nov 2023 10:33:44 +1100 David Gibson
wrote: In a number of places, we use indices into the flow table to identify a specific flow. We also have cases where we need to identify a particular side of a particular flow, and we expect those to become more common as we generalise the flow table to cover more things.
To assist with that, introduces flow_sidx_t, an index type which identifies a specific side of a specific flow in the table.
Signed-off-by: David Gibson
--- flow.h | 13 +++++++++++++ flow_table.h | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+) diff --git a/flow.h b/flow.h index b6da516..3c90bbd 100644 --- a/flow.h +++ b/flow.h @@ -39,6 +39,19 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
+/** + * struct flow_sidx - ID for one side of a specific flow + * @side: Side referenced (0 or 1) + * @flow: Index of flow referenced + */ +typedef struct flow_sidx {
Implying my usual argument :) ...is there any advantage over using this simply as a struct?
So, usually I too would prefer to use a struct as a struct, without a typedef. The reason I'm doing differently here, is that I want to emphasise that for many purposes this can be treated like an index, in particular that it's small and trivially copyable. In particular it should be passed by value, passing by reference would be silly. That's kind of the opposite of what one tends to be conveying by reminding users that they're working with a struct.
+ int side :1; + unsigned flow :FLOW_INDEX_BITS; +} flow_sidx_t;
-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 30 Nov 2023 11:37:40 +1100
David Gibson
On Wed, Nov 29, 2023 at 03:32:32PM +0100, Stefano Brivio wrote:
On Mon, 27 Nov 2023 10:33:44 +1100 David Gibson
wrote: In a number of places, we use indices into the flow table to identify a specific flow. We also have cases where we need to identify a particular side of a particular flow, and we expect those to become more common as we generalise the flow table to cover more things.
To assist with that, introduces flow_sidx_t, an index type which identifies a specific side of a specific flow in the table.
Signed-off-by: David Gibson
--- flow.h | 13 +++++++++++++ flow_table.h | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+) diff --git a/flow.h b/flow.h index b6da516..3c90bbd 100644 --- a/flow.h +++ b/flow.h @@ -39,6 +39,19 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
+/** + * struct flow_sidx - ID for one side of a specific flow + * @side: Side referenced (0 or 1) + * @flow: Index of flow referenced + */ +typedef struct flow_sidx {
Implying my usual argument :) ...is there any advantage over using this simply as a struct?
So, usually I too would prefer to use a struct as a struct, without a typedef. The reason I'm doing differently here, is that I want to emphasise that for many purposes this can be treated like an index, in particular that it's small and trivially copyable. In particular it should be passed by value, passing by reference would be silly.
Hmm, that was exactly my "not hiding" point though. The day somebody adds here: char mood[RLIMIT_STACK_VAL + 1]; /* list of side emojis */ the typedef makes it still apparently okay to pass by value. If it's a struct, one surely has to check first.
That's kind of the opposite of what one tends to be conveying by reminding users that they're working with a struct.
I see, but it's probably a matter of taste (passing structs by value doesn't personally make me nervous). -- Stefano
On Thu, Nov 30, 2023 at 10:21:16AM +0100, Stefano Brivio wrote:
On Thu, 30 Nov 2023 11:37:40 +1100 David Gibson
wrote: On Wed, Nov 29, 2023 at 03:32:32PM +0100, Stefano Brivio wrote:
On Mon, 27 Nov 2023 10:33:44 +1100 David Gibson
wrote: In a number of places, we use indices into the flow table to identify a specific flow. We also have cases where we need to identify a particular side of a particular flow, and we expect those to become more common as we generalise the flow table to cover more things.
To assist with that, introduces flow_sidx_t, an index type which identifies a specific side of a specific flow in the table.
Signed-off-by: David Gibson
--- flow.h | 13 +++++++++++++ flow_table.h | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+) diff --git a/flow.h b/flow.h index b6da516..3c90bbd 100644 --- a/flow.h +++ b/flow.h @@ -39,6 +39,19 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
+/** + * struct flow_sidx - ID for one side of a specific flow + * @side: Side referenced (0 or 1) + * @flow: Index of flow referenced + */ +typedef struct flow_sidx {
Implying my usual argument :) ...is there any advantage over using this simply as a struct?
So, usually I too would prefer to use a struct as a struct, without a typedef. The reason I'm doing differently here, is that I want to emphasise that for many purposes this can be treated like an index, in particular that it's small and trivially copyable. In particular it should be passed by value, passing by reference would be silly.
Hmm, that was exactly my "not hiding" point though. The day somebody adds here:
char mood[RLIMIT_STACK_VAL + 1]; /* list of side emojis */
the typedef makes it still apparently okay to pass by value. If it's a struct, one surely has to check first.
Right.. and that's kind of the point. If someone adds that, this struct is no longer doing the job intended for it, and a wider redesign is needed. That's why there's also the static_assert() verifying that flow_sidx_t fits in a u32.
That's kind of the opposite of what one tends to be conveying by reminding users that they're working with a struct.
I see, but it's probably a matter of taste (passing structs by value doesn't personally make me nervous).
-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
In tcp_timer_handler() we use conn_at_idx() to interpret the flow index
from the epoll reference. However, this will never be NULL - we always
put a valid index into the epoll_ref. Simplify slightly based on this.
Signed-off-by: David Gibson
On Mon, 27 Nov 2023 10:33:45 +1100
David Gibson
In tcp_timer_handler() we use conn_at_idx() to interpret the flow index from the epoll reference. However, this will never be NULL - we always put a valid index into the epoll_ref. Simplify slightly based on this.
Sorry, I missed this during review of v1. I have mixed feeling about this, and I don't think 11/11 changes anything in this regard: we have to trust the kernel, as there's no benefit to security in not doing so. At the same time, should we ever get an out-of-bounds index from the epoll event, we can fail gracefully here. Slightly worse, however: if we get a timer event for a connection that's already closed, we'll happily go and try to manipulate its timer (with or without the !conn check). All in all, I think the check is minimally useful, and we should have something more robust than that. So if this patch helps keeping things simple later in the series, I don't see an issue with that, but perhaps we should add back a more sensible set of checks later. The next patches all look good to me. -- Stefano
On Wed, Nov 29, 2023 at 03:32:39PM +0100, Stefano Brivio wrote:
On Mon, 27 Nov 2023 10:33:45 +1100 David Gibson
wrote: In tcp_timer_handler() we use conn_at_idx() to interpret the flow index from the epoll reference. However, this will never be NULL - we always put a valid index into the epoll_ref. Simplify slightly based on this.
Sorry, I missed this during review of v1.
I have mixed feeling about this, and I don't think 11/11 changes anything in this regard: we have to trust the kernel, as there's no benefit to security in not doing so.
At the same time, should we ever get an out-of-bounds index from the epoll event, we can fail gracefully here. Slightly worse, however: if we get a timer event for a connection that's already closed, we'll happily go and try to manipulate its timer (with or without the !conn check).
So, as you note this only verifies that the index is theoretically possible. It doesn't check that it's a valid index in the current size of the connection table, it doesn't check if the connection is already closed and it can't check if it's a stale index for a different connection than the one originally intended, because the table has been compacted. Given all those potential failure modes, I don't see a lot of value in checking if the index is wildly out of bounds, which would require a kernel bug rather more extreme than those other possibilies.
All in all, I think the check is minimally useful, and we should have something more robust than that. So if this patch helps keeping things simple later in the series, I don't see an issue with that, but perhaps we should add back a more sensible set of checks later.
Perhaps, yes.
The next patches all look good to me.
-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
TCP uses three different epoll object types: one for connected sockets, one
for timers and one for listening sockets. Listening sockets really need
information that's specific to TCP, so need their own epoll_ref field.
Timers and connected sockets, however, only need the connection (flow)
they're associated with. As we expand the use of the flow table, we expect
that to be true for more epoll fds. So, rename the "TCP" epoll_ref field
to be a "flow" epoll_ref field that can be used both for TCP and for other
future cases.
Signed-off-by: David Gibson
Currently, we use 'int' values to represent the "side" of a connection,
which must always be 0 or 1. This turns out to be dangerous.
In some cases we're going to want to put the side into a 1-bit bitfield.
However, if that bitfield has type 'int', when we copy it out to a regular
'int' variable, it will be sign-extended and so have values 0 and -1,
instead of 0 and 1.
To avoid this, always use unsigned variables for the side.
Signed-off-by: David Gibson
Currently TCP uses the 'flow' epoll_ref field for both connected
sockets and timers, which consists of just the index of the relevant
flow (connection).
This is just fine for timers, for while it obviously works, it's
subtly incomplete for sockets on spliced connections. In that case we
want to know which side of the connection the event is occurring on as
well as which connection. At present, we deduce that information by
looking at the actual fd, and comparing it to the fds of the sockets
on each side.
When we use the flow table for more things, we expect more cases where
something will need to know a specific side of a specific flow for an
event, but nothing more.
Therefore add a new 'flowside' epoll_ref field, with exactly that
information. We use it for TCP connected sockets. This allows us to
directly know the side for spliced connections. For "tap"
connections, it's pretty meaningless, since the side is always the
socket side. It still makes logical sense though, and it may become
important for future flow table work.
Signed-off-by: David Gibson
participants (2)
-
David Gibson
-
Stefano Brivio