I now have an in-progress draft of a unified hash table to go with the unified flow table. This turns out to be easier if we first make some preliminary changes to the structure of the TCP hash table. So, here are those. Changes since v1: * Use while loops instead of some equivalent, but hard to read for loops for the hash probing. * Switch from probing forwards through hash buckets to probing backwards. This makes the code closer to the version in Knuth its based on, and thus easier to see if we've made a mistake in adaptation. * Improve the helpers for modular arithmetic in use * Correct an error where we had things exactly the wrong way around when finding entries to move during removal. * Add a patch fixing a conceptual / documentation problem in some adjacent code David Gibson (4): tcp: Fix conceptually incorrect byte-order switch in tcp_tap_handler() tcp: Switch hash table to linear probing instead of chaining tcp: Implement hash table with indices rather than pointers tcp: Don't account for hash table size in tcp_hash() flow.h | 11 +++++ tcp.c | 143 ++++++++++++++++++++++++++++------------------------- tcp_conn.h | 2 - util.h | 28 +++++++++++ 4 files changed, 114 insertions(+), 70 deletions(-) -- 2.43.0
tcp_hash_lookup() expects the port numbers in host order, but the TCP header, of course, has them in network order, so we need to switch them. However we call htons() (host to network) instead of ntohs() (network to host). This works because those do the same thing in practice (they only wouldn't on very strange theoretical platforms which are neither big nor little endian). But, having this the "wrong" way around is misleading, so switch it around. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index f506cfd..bf58938 100644 --- a/tcp.c +++ b/tcp.c @@ -2532,7 +2532,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, daddr, htons(th->source), htons(th->dest)); + conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest)); /* New connection from tap */ if (!conn) { -- 2.43.0
Currently we deal with hash collisions by letting a hash bucket contain multiple entries, forming a linked list using an index in the connection structure. That's a pretty standard and simple approach, but in our case we can use an even simpler one: linear probing. Here if a hash bucket is occupied we just move onto the next one until we find a feww one. This slightly simplifies lookup and more importantly saves some precious bytes in the connection structure by removing the need for a link. It does require some additional complexity for hash removal. This approach can perform poorly with hash table load is high. However, we already size our hash table of pointers larger than the connection table, which puts an upper bound on the load. It's relatively cheap to decrease that bound if we find we need to. I adapted the linear probing operations from Knuth's The Art of Computer Programming, Volume 3, 2nd Edition. Specifically Algorithm L and Algorithm R in Section 6.4. Note that there is an error in Algorithm R as printed, see errata at [0]. [0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 107 ++++++++++++++++++++++++++--------------------------- tcp_conn.h | 2 - util.h | 28 ++++++++++++++ 3 files changed, 81 insertions(+), 56 deletions(-) diff --git a/tcp.c b/tcp.c index bf58938..cd8e64e 100644 --- a/tcp.c +++ b/tcp.c @@ -573,22 +573,12 @@ static unsigned int tcp6_l2_flags_buf_used; #define CONN(idx) (&(FLOW(idx)->tcp)) -/** conn_at_idx() - Find a connection by index, if present - * @idx: Index of connection to lookup - * - * Return: pointer to connection, or NULL if @idx is out of bounds - */ -static inline struct tcp_tap_conn *conn_at_idx(unsigned idx) -{ - if (idx >= FLOW_MAX) - return NULL; - ASSERT(CONN(idx)->f.type == FLOW_TCP); - return CONN(idx); -} - /* Table for lookup from remote address, local port, remote port */ static struct tcp_tap_conn *tc_hash[TCP_HASH_TABLE_SIZE]; +static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, + "Safe linear probing requires hash table larger than connection table"); + /* Pools for pre-opened sockets (in init) */ int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; @@ -1196,6 +1186,26 @@ static unsigned int tcp_conn_hash(const struct ctx *c, return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); } +/** + * tcp_hash_probe() - Find hash bucket for a connection + * @c: Execution context + * @conn: Connection to find bucket for + * + * Return: If @conn is in the table, its current bucket, otherwise a suitable + * free bucket for it. + */ +static inline unsigned tcp_hash_probe(const struct ctx *c, + const struct tcp_tap_conn *conn) +{ + unsigned b = tcp_conn_hash(c, conn); + + /* Linear probing */ + while (tc_hash[b] && tc_hash[b] != conn) + b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); + + return b; +} + /** * tcp_hash_insert() - Insert connection into hash table, chain link * @c: Execution context @@ -1203,14 +1213,10 @@ static unsigned int tcp_conn_hash(const struct ctx *c, */ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) { - int b; + unsigned b = tcp_hash_probe(c, conn); - b = tcp_hash(c, &conn->faddr, conn->eport, conn->fport); - conn->next_index = tc_hash[b] ? FLOW_IDX(tc_hash[b]) : -1U; tc_hash[b] = conn; - - flow_dbg(conn, "hash table insert: sock %i, bucket: %i, next: %p", - conn->sock, b, (void *)conn_at_idx(conn->next_index)); + flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b); } /** @@ -1221,23 +1227,27 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) static void tcp_hash_remove(const struct ctx *c, const struct tcp_tap_conn *conn) { - struct tcp_tap_conn *entry, *prev = NULL; - int b = tcp_conn_hash(c, conn); + unsigned b = tcp_hash_probe(c, conn), s; - for (entry = tc_hash[b]; entry; - prev = entry, entry = conn_at_idx(entry->next_index)) { - if (entry == conn) { - if (prev) - prev->next_index = conn->next_index; - else - tc_hash[b] = conn_at_idx(conn->next_index); - break; + if (!tc_hash[b]) + return; /* Redundant remove */ + + flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b); + + /* Scan the remainder of the cluster */ + for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); tc_hash[s]; + s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) { + unsigned h = tcp_conn_hash(c, tc_hash[s]); + + if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) { + /* tc_hash[s] can live in tc_hash[b]'s slot */ + debug("hash table remove: shuffle %u -> %u", s, b); + tc_hash[b] = tc_hash[s]; + b = s; } } - flow_dbg(conn, "hash table remove: sock %i, bucket: %i, new: %p", - conn->sock, b, - (void *)(prev ? conn_at_idx(prev->next_index) : tc_hash[b])); + tc_hash[b] = NULL; } /** @@ -1250,24 +1260,15 @@ void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, struct tcp_tap_conn *new) { - struct tcp_tap_conn *entry, *prev = NULL; - int b = tcp_conn_hash(c, old); + unsigned b = tcp_hash_probe(c, old); - for (entry = tc_hash[b]; entry; - prev = entry, entry = conn_at_idx(entry->next_index)) { - if (entry == old) { - if (prev) - prev->next_index = FLOW_IDX(new); - else - tc_hash[b] = new; - break; - } - } + if (!tc_hash[b]) + return; /* Not in hash table, nothing to update */ + + tc_hash[b] = new; debug("TCP: hash table update: old index %u, new index %u, sock %i, " - "bucket: %i, old: %p, new: %p", - FLOW_IDX(old), FLOW_IDX(new), new->sock, b, - (void *)old, (void *)new); + "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b); tcp_epoll_ctl(c, new); } @@ -1287,17 +1288,15 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, in_port_t eport, in_port_t fport) { union inany_addr aany; - struct tcp_tap_conn *conn; - int b; + unsigned b; inany_from_af(&aany, af, faddr); + b = tcp_hash(c, &aany, eport, fport); - for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) { - if (tcp_hash_match(conn, &aany, eport, fport)) - return conn; - } + while (tc_hash[b] && !tcp_hash_match(tc_hash[b], &aany, eport, fport)) + b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - return NULL; + return tc_hash[b]; } /** diff --git a/tcp_conn.h b/tcp_conn.h index 3900305..e3400bb 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -13,7 +13,6 @@ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? - * @next_index: Connection index of next item in hash chain, -1 for none * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number * @events: Connection events, implying connection states @@ -40,7 +39,6 @@ struct tcp_tap_conn { struct flow_common f; bool in_epoll :1; - unsigned next_index :FLOW_INDEX_BITS + 2; #define TCP_RETRANS_BITS 3 unsigned int retrans :TCP_RETRANS_BITS; diff --git a/util.h b/util.h index 53bb54b..9446ea7 100644 --- a/util.h +++ b/util.h @@ -227,6 +227,34 @@ int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +/** + * mod_sub() - Modular arithmetic subtraction + * @a: Minued, unsigned value < @m + * @b: Subtrahend, unsigned value < @m + * @m: Modulus, must be less than (UINT_MAX / 2) + * + * Returns (@a - @b) mod @m, correctly handling unsigned underflows. + */ +static inline unsigned mod_sub(unsigned a, unsigned b, unsigned m) +{ + if (a < b) + a += m; + return a - b; +} + +/** + * mod_between() - Determine if a value is in a cyclic range + * @x, @i, @j: Unsigned values < @m + * @m: Modulus + * + * Returns true iff @x is in the cyclic range of values from @i..@j (mod @m), + * inclusive of @i, exclusive of @j. + */ +static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m) +{ + return mod_sub(x, i, m) < mod_sub(j, i, m); +} + /* * Workarounds for https://github.com/llvm/llvm-project/issues/58992 * -- 2.43.0
We implement our hash table with pointers to the entry for each bucket (or NULL). However, the entries are always allocated within the flow table, meaning that a flow index will suffice, halving the size of the hash table. For TCP, just a flow index would be enough, but future uses will want to expand the hash table to cover indexing either side of a flow, so use a flow_sidx_t as the type for each hash bucket. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.h | 11 +++++++++++ tcp.c | 33 ++++++++++++++++++++++----------- 2 files changed, 33 insertions(+), 11 deletions(-) diff --git a/flow.h b/flow.h index c2a5190..959b461 100644 --- a/flow.h +++ b/flow.h @@ -53,6 +53,17 @@ static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t), #define FLOW_SIDX_NONE ((flow_sidx_t){ .flow = FLOW_MAX }) +/** + * flow_sidx_eq() - Test if two sidx values are equal + * @a, @b: sidx values + * + * Return: true iff @a and @b refer to the same side of the same flow + */ +static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) +{ + return (a.flow == b.flow) && (a.side == b.side); +} + union flow; void flow_table_compact(struct ctx *c, union flow *hole); diff --git a/tcp.c b/tcp.c index cd8e64e..8dc9f31 100644 --- a/tcp.c +++ b/tcp.c @@ -574,7 +574,7 @@ static unsigned int tcp6_l2_flags_buf_used; #define CONN(idx) (&(FLOW(idx)->tcp)) /* Table for lookup from remote address, local port, remote port */ -static struct tcp_tap_conn *tc_hash[TCP_HASH_TABLE_SIZE]; +static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, "Safe linear probing requires hash table larger than connection table"); @@ -1197,10 +1197,12 @@ static unsigned int tcp_conn_hash(const struct ctx *c, static inline unsigned tcp_hash_probe(const struct ctx *c, const struct tcp_tap_conn *conn) { + flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE); unsigned b = tcp_conn_hash(c, conn); /* Linear probing */ - while (tc_hash[b] && tc_hash[b] != conn) + while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) && + !flow_sidx_eq(tc_hash[b], sidx)) b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); return b; @@ -1215,7 +1217,7 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) { unsigned b = tcp_hash_probe(c, conn); - tc_hash[b] = conn; + tc_hash[b] = FLOW_SIDX(conn, TAPSIDE); flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b); } @@ -1228,16 +1230,18 @@ static void tcp_hash_remove(const struct ctx *c, const struct tcp_tap_conn *conn) { unsigned b = tcp_hash_probe(c, conn), s; + union flow *flow = flow_at_sidx(tc_hash[b]); - if (!tc_hash[b]) + if (!flow) return; /* Redundant remove */ flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b); /* Scan the remainder of the cluster */ - for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); tc_hash[s]; + for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); + (flow = flow_at_sidx(tc_hash[s])); s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) { - unsigned h = tcp_conn_hash(c, tc_hash[s]); + unsigned h = tcp_conn_hash(c, &flow->tcp); if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) { /* tc_hash[s] can live in tc_hash[b]'s slot */ @@ -1247,7 +1251,7 @@ static void tcp_hash_remove(const struct ctx *c, } } - tc_hash[b] = NULL; + tc_hash[b] = FLOW_SIDX_NONE; } /** @@ -1262,10 +1266,10 @@ void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, { unsigned b = tcp_hash_probe(c, old); - if (!tc_hash[b]) + if (!flow_at_sidx(tc_hash[b])) return; /* Not in hash table, nothing to update */ - tc_hash[b] = new; + tc_hash[b] = FLOW_SIDX(new, TAPSIDE); debug("TCP: hash table update: old index %u, new index %u, sock %i, " "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b); @@ -1288,15 +1292,17 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, in_port_t eport, in_port_t fport) { union inany_addr aany; + union flow *flow; unsigned b; inany_from_af(&aany, af, faddr); b = tcp_hash(c, &aany, eport, fport); - while (tc_hash[b] && !tcp_hash_match(tc_hash[b], &aany, eport, fport)) + while ((flow = flow_at_sidx(tc_hash[b])) && + !tcp_hash_match(&flow->tcp, &aany, eport, fport)) b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - return tc_hash[b]; + return &flow->tcp; } /** @@ -3087,6 +3093,11 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { + unsigned b; + + for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) + tc_hash[b] = FLOW_SIDX_NONE; + if (c->ifi4) tcp_sock4_iov_init(c); -- 2.43.0
Currently tcp_hash() returns the hash bucket for a value, that is the hash modulo the size of the hash table. Usually it's a bit more flexible to have hash functions return a "raw" hash value and perform the modulus in the callers. That allows the same hash function to be used for multiple tables of different sizes, or to re-use the hash for other purposes. We don't do anything like that with tcp_hash() at present, but we have some plans to do so. Prepare for that by making tcp_hash() and tcp_conn_hash() return raw hash values. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/tcp.c b/tcp.c index 8dc9f31..63b39e0 100644 --- a/tcp.c +++ b/tcp.c @@ -1159,18 +1159,15 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, * @eport: Guest side endpoint port * @fport: Guest side forwarding port * - * Return: hash value, already modulo size of the hash table + * Return: hash value, needs to be adjusted for table size */ -static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *faddr, - in_port_t eport, in_port_t fport) +static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, + in_port_t eport, in_port_t fport) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); - uint64_t hash; inany_siphash_feed(&state, faddr); - hash = siphash_final(&state, 20, (uint64_t)eport << 16 | fport); - - return (unsigned int)(hash % TCP_HASH_TABLE_SIZE); + return siphash_final(&state, 20, (uint64_t)eport << 16 | fport); } /** @@ -1178,10 +1175,10 @@ static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *faddr, * @c: Execution context * @conn: Connection * - * Return: hash value, already modulo size of the hash table + * Return: hash value, needs to be adjusted for table size */ -static unsigned int tcp_conn_hash(const struct ctx *c, - const struct tcp_tap_conn *conn) +static uint64_t tcp_conn_hash(const struct ctx *c, + const struct tcp_tap_conn *conn) { return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); } @@ -1198,7 +1195,7 @@ static inline unsigned tcp_hash_probe(const struct ctx *c, const struct tcp_tap_conn *conn) { flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE); - unsigned b = tcp_conn_hash(c, conn); + unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE; /* Linear probing */ while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) && @@ -1241,7 +1238,7 @@ static void tcp_hash_remove(const struct ctx *c, for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); (flow = flow_at_sidx(tc_hash[s])); s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) { - unsigned h = tcp_conn_hash(c, &flow->tcp); + unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE; if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) { /* tc_hash[s] can live in tc_hash[b]'s slot */ @@ -1297,7 +1294,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, inany_from_af(&aany, af, faddr); - b = tcp_hash(c, &aany, eport, fport); + b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE; while ((flow = flow_at_sidx(tc_hash[b])) && !tcp_hash_match(&flow->tcp, &aany, eport, fport)) b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); -- 2.43.0
On Thu, 7 Dec 2023 16:53:49 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:I now have an in-progress draft of a unified hash table to go with the unified flow table. This turns out to be easier if we first make some preliminary changes to the structure of the TCP hash table. So, here are those. Changes since v1: * Use while loops instead of some equivalent, but hard to read for loops for the hash probing. * Switch from probing forwards through hash buckets to probing backwards. This makes the code closer to the version in Knuth its based on, and thus easier to see if we've made a mistake in adaptation. * Improve the helpers for modular arithmetic in use * Correct an error where we had things exactly the wrong way around when finding entries to move during removal. * Add a patch fixing a conceptual / documentation problem in some adjacent code David Gibson (4): tcp: Fix conceptually incorrect byte-order switch in tcp_tap_handler() tcp: Switch hash table to linear probing instead of chaining tcp: Implement hash table with indices rather than pointers tcp: Don't account for hash table size in tcp_hash()Applied. -- Stefano