[PATCH v3 00/13] Manage more flow related things from generic flow code

newer
[PATCH v2 00/12] RFC: ICMP reworks...

David Gibson

21 Dec 2023 21 Dec '23

7:15 a.m.

There are a number of things that are more-or-less general to flows which are still explicitly handled in tcp.c and tcp_splice.c including allocation and freeing of flow entries, and dispatch of deferred and timer functions. Even without adding more fields to the common flow structure, we can handle a number of these in a more flow-centric way. Unlike v1 this version is based on the hash table rework series. Changes since v2: * Realised the prealloc/commit functions where confusing and worked poorly for some future stuff. Replaced with alloc/alloc_cancel * Fixed a bug where newly allocated flow entries might not be 0-filled, because of the free tracking information in there. This could cause very subtle problems. Changes since v1: * Store the timestamp of last flow timers run in a global, rather than a ctx field * Rebased on the TCP hash table rework * Add patches 9..13/13 with changes to allocation and freeing of flow entries. David Gibson (13): flow: Make flow_table.h #include the protocol specific headers it needs treewide: Standardise on 'now' for current timestamp variables tcp, tcp_splice: Remove redundant handling from tcp_timer() tcp, tcp_splice: Move per-type cleanup logic into per-type helpers flow, tcp: Add flow-centric dispatch for deferred flow handling flow, tcp: Add handling for per-flow timers epoll: Better handling of number of epoll types tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets flow: Move flow_log_() to near top of flow.c flow: Move flow_count from context structure to a global flow: Abstract allocation of new flows with helper function flow: Enforce that freeing of closed flows must happen in deferred handlers flow: Avoid moving flow entries to compact table flow.c | 223 ++++++++++++++++++++++++++++++++++++++++++--------- flow.h | 5 +- flow_table.h | 20 +++++ icmp.c | 12 +-- icmp.h | 2 +- log.c | 34 ++++---- passt.c | 20 +++-- passt.h | 9 +-- tcp.c | 143 +++++++++------------------------ tcp.h | 2 +- tcp_conn.h | 8 +- tcp_splice.c | 49 +++++------ tcp_splice.h | 4 +- udp.c | 16 ++-- udp.h | 2 +- 15 files changed, 324 insertions(+), 225 deletions(-) -- 2.43.0

Show replies by date

David Gibson

21 Dec 21 Dec

7:15 a.m.

New subject: [PATCH v3 01/13] flow: Make flow_table.h #include the protocol specific headers it needs

flow_table.h, the lower level flow header relies on having the struct definitions for every protocol specific flow type - so far that means tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included before flow_table.h. That's ok for now, but as we use the flow table for more things, flow_table.h will need the structs for all of them, which means the protocol specific .c files would need to include tcp_conn.h _and_ the equivalents for every other flow type before flow_table.h every time, which is weird. So, although we *mostly* lean towards the include style where .c files need to handle the include dependencies, in this case it makes more sense to have flow_table.h include all the protocol specific headers it needs. Signed-off-by: David Gibson --- flow.c | 1 - flow_table.h | 2 ++ tcp.c | 1 - tcp_splice.c | 1 - 4 files changed, 2 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index d24726d..a1c0a34 100644 --- a/flow.c +++ b/flow.c @@ -14,7 +14,6 @@ #include "siphash.h" #include "inany.h" #include "flow.h" -#include "tcp_conn.h" #include "flow_table.h" const char *flow_type_str[] = { diff --git a/flow_table.h b/flow_table.h index 0dee66f..e805f10 100644 --- a/flow_table.h +++ b/flow_table.h @@ -7,6 +7,8 @@ #ifndef FLOW_TABLE_H #define FLOW_TABLE_H +#include "tcp_conn.h" + /** * union flow - Descriptor for a logical packet flow (e.g. connection) * @f: Fields common between all variants diff --git a/tcp.c b/tcp.c index 63b39e0..422770f 100644 --- a/tcp.c +++ b/tcp.c @@ -298,7 +298,6 @@ #include "inany.h" #include "flow.h" -#include "tcp_conn.h" #include "flow_table.h" /* Sides of a flow as we use them in "tap" connections */ diff --git a/tcp_splice.c b/tcp_splice.c index 69ea79d..052f989 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -56,7 +56,6 @@ #include "inany.h" #include "flow.h" -#include "tcp_conn.h" #include "flow_table.h" #define MAX_PIPE_SIZE (8UL * 1024 * 1024) -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 02/13] treewide: Standardise on 'now' for current timestamp variables

In a number of places we pass around a struct timespec representing the (more or less) current time. Sometimes we call it 'now', and sometimes we call it 'ts'. Standardise on the more informative 'now'. Signed-off-by: David Gibson --- icmp.c | 12 ++++++------ icmp.h | 2 +- log.c | 34 +++++++++++++++++----------------- tcp.c | 6 +++--- tcp.h | 2 +- udp.c | 16 ++++++++-------- udp.h | 2 +- 7 files changed, 37 insertions(+), 37 deletions(-) diff --git a/icmp.c b/icmp.c index a1de8ae..a9fbcb2 100644 --- a/icmp.c +++ b/icmp.c @@ -290,14 +290,14 @@ fail_sock: * @c: Execution context * @v6: Set for IPv6 echo identifier bindings * @id: Echo identifier, host order - * @ts: Timestamp from caller + * @now: Current timestamp */ static void icmp_timer_one(const struct ctx *c, int v6, uint16_t id, - const struct timespec *ts) + const struct timespec *now) { struct icmp_id_sock *id_map = &icmp_id_map[v6 ? V6 : V4][id]; - if (ts->tv_sec - id_map->ts <= ICMP_ECHO_TIMEOUT) + if (now->tv_sec - id_map->ts <= ICMP_ECHO_TIMEOUT) return; bitmap_clear(icmp_act[v6 ? V6 : V4], id); @@ -311,9 +311,9 @@ static void icmp_timer_one(const struct ctx *c, int v6, uint16_t id, /** * icmp_timer() - Scan activity bitmap for identifiers with timed events * @c: Execution context - * @ts: Timestamp from caller + * @now: Current timestamp */ -void icmp_timer(const struct ctx *c, const struct timespec *ts) +void icmp_timer(const struct ctx *c, const struct timespec *now) { long *word, tmp; unsigned int i; @@ -325,7 +325,7 @@ v6: tmp = *word; while ((n = ffsl(tmp))) { tmp &= ~(1UL << (n - 1)); - icmp_timer_one(c, v6, i * 8 + n - 1, ts); + icmp_timer_one(c, v6, i * 8 + n - 1, now); } } diff --git a/icmp.h b/icmp.h index 44cc495..1a08594 100644 --- a/icmp.h +++ b/icmp.h @@ -15,7 +15,7 @@ void icmpv6_sock_handler(const struct ctx *c, union epoll_ref ref); int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now); -void icmp_timer(const struct ctx *c, const struct timespec *ts); +void icmp_timer(const struct ctx *c, const struct timespec *now); void icmp_init(void); /** diff --git a/log.c b/log.c index b206f72..d7f6d35 100644 --- a/log.c +++ b/log.c @@ -216,11 +216,11 @@ void logfile_init(const char *name, const char *path, size_t size) /** * logfile_rotate_fallocate() - Write header, set log_written after fallocate() * @fd: Log file descriptor - * @ts: Current timestamp + * @now: Current timestamp * * #syscalls lseek ppc64le:_llseek ppc64:_llseek armv6l:_llseek armv7l:_llseek */ -static void logfile_rotate_fallocate(int fd, const struct timespec *ts) +static void logfile_rotate_fallocate(int fd, const struct timespec *now) { char buf[BUFSIZ], *nl; int n; @@ -232,8 +232,8 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *ts) n = snprintf(buf, BUFSIZ, "%s - log truncated at %lli.%04lli", log_header, - (long long int)(ts->tv_sec - log_start), - (long long int)(ts->tv_nsec / (100L * 1000))); + (long long int)(now->tv_sec - log_start), + (long long int)(now->tv_nsec / (100L * 1000))); /* Avoid partial lines by padding the header with spaces */ nl = memchr(buf + n + 1, '\n', BUFSIZ - n - 1); @@ -252,20 +252,20 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *ts) /** * logfile_rotate_move() - Fallback: move recent entries toward start, then cut * @fd: Log file descriptor - * @ts: Current timestamp + * @now: Current timestamp * * #syscalls lseek ppc64le:_llseek ppc64:_llseek armv6l:_llseek armv7l:_llseek * #syscalls ftruncate */ -static void logfile_rotate_move(int fd, const struct timespec *ts) +static void logfile_rotate_move(int fd, const struct timespec *now) { int header_len, write_offset, end, discard, n; char buf[BUFSIZ], *nl; header_len = snprintf(buf, BUFSIZ, "%s - log truncated at %lli.%04lli\n", log_header, - (long long int)(ts->tv_sec - log_start), - (long long int)(ts->tv_nsec / (100L * 1000))); + (long long int)(now->tv_sec - log_start), + (long long int)(now->tv_nsec / (100L * 1000))); if (lseek(fd, 0, SEEK_SET) == -1) return; if (write(fd, buf, header_len) == -1) @@ -314,7 +314,7 @@ out: /** * logfile_rotate() - "Rotate" log file once it's full * @fd: Log file descriptor - * @ts: Current timestamp + * @now: Current timestamp * * Return: 0 on success, negative error code on failure * @@ -322,7 +322,7 @@ out: * * fallocate() passed as EXTRA_SYSCALL only if FALLOC_FL_COLLAPSE_RANGE is there */ -static int logfile_rotate(int fd, const struct timespec *ts) +static int logfile_rotate(int fd, const struct timespec *now) { if (fcntl(fd, F_SETFL, O_RDWR /* Drop O_APPEND: explicit lseek() */)) return -errno; @@ -330,10 +330,10 @@ static int logfile_rotate(int fd, const struct timespec *ts) #ifdef FALLOC_FL_COLLAPSE_RANGE /* Only for Linux >= 3.15, extent-based ext4 or XFS, glibc >= 2.18 */ if (!fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, log_cut_size)) - logfile_rotate_fallocate(fd, ts); + logfile_rotate_fallocate(fd, now); else #endif - logfile_rotate_move(fd, ts); + logfile_rotate_move(fd, now); if (fcntl(fd, F_SETFL, O_RDWR | O_APPEND)) return -errno; @@ -349,16 +349,16 @@ static int logfile_rotate(int fd, const struct timespec *ts) */ void logfile_write(int pri, const char *format, va_list ap) { - struct timespec ts; + struct timespec now; char buf[BUFSIZ]; int n; - if (clock_gettime(CLOCK_REALTIME, &ts)) + if (clock_gettime(CLOCK_REALTIME, &now)) return; n = snprintf(buf, BUFSIZ, "%lli.%04lli: %s", - (long long int)(ts.tv_sec - log_start), - (long long int)(ts.tv_nsec / (100L * 1000)), + (long long int)(now.tv_sec - log_start), + (long long int)(now.tv_nsec / (100L * 1000)), logfile_prefix[pri]); n += vsnprintf(buf + n, BUFSIZ - n, format, ap); @@ -366,7 +366,7 @@ void logfile_write(int pri, const char *format, va_list ap) if (format[strlen(format)] != '\n') n += snprintf(buf + n, BUFSIZ - n, "\n"); - if ((log_written + n >= log_size) && logfile_rotate(log_file, &ts)) + if ((log_written + n >= log_size) && logfile_rotate(log_file, &now)) return; if ((n = write(log_file, buf, n)) >= 0) diff --git a/tcp.c b/tcp.c index 422770f..1628896 100644 --- a/tcp.c +++ b/tcp.c @@ -3180,13 +3180,13 @@ static int tcp_port_rebind_outbound(void *arg) /** * tcp_timer() - Periodic tasks: port detection, closed connections, pool refill * @c: Execution context - * @ts: Unused + * @now: Current timestamp */ -void tcp_timer(struct ctx *c, const struct timespec *ts) +void tcp_timer(struct ctx *c, const struct timespec *now) { union flow *flow; - (void)ts; + (void)now; if (c->mode == MODE_PASTA) { if (c->tcp.fwd_out.mode == FWD_AUTO) { diff --git a/tcp.h b/tcp.h index 27b1166..594d71a 100644 --- a/tcp.h +++ b/tcp.h @@ -20,7 +20,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af, int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, const char *ifname, in_port_t port); int tcp_init(struct ctx *c); -void tcp_timer(struct ctx *c, const struct timespec *ts); +void tcp_timer(struct ctx *c, const struct timespec *now); void tcp_defer_handler(struct ctx *c); void tcp_sock_set_bufsize(const struct ctx *c, int s); diff --git a/udp.c b/udp.c index 1f8c306..0b4582c 100644 --- a/udp.c +++ b/udp.c @@ -1140,10 +1140,10 @@ int udp_init(struct ctx *c) * @v6: Set for IPv6 connections * @type: Socket type * @port: Port number, host order - * @ts: Timestamp from caller + * @now: Current timestamp */ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, - in_port_t port, const struct timespec *ts) + in_port_t port, const struct timespec *now) { struct udp_splice_port *sp; struct udp_tap_port *tp; @@ -1153,7 +1153,7 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, case UDP_ACT_TAP: tp = &udp_tap_map[v6 ? V6 : V4][port]; - if (ts->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { + if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { sockp = &tp->sock; tp->flags = 0; } @@ -1162,14 +1162,14 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, case UDP_ACT_SPLICE_INIT: sp = &udp_splice_init[v6 ? V6 : V4][port]; - if (ts->tv_sec - sp->ts > UDP_CONN_TIMEOUT) + if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) sockp = &sp->sock; break; case UDP_ACT_SPLICE_NS: sp = &udp_splice_ns[v6 ? V6 : V4][port]; - if (ts->tv_sec - sp->ts > UDP_CONN_TIMEOUT) + if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) sockp = &sp->sock; break; @@ -1249,9 +1249,9 @@ static int udp_port_rebind_outbound(void *arg) /** * udp_timer() - Scan activity bitmaps for ports with associated timed events * @c: Execution context - * @ts: Timestamp from caller + * @now: Current timestamp */ -void udp_timer(struct ctx *c, const struct timespec *ts) +void udp_timer(struct ctx *c, const struct timespec *now) { int n, t, v6 = 0; unsigned int i; @@ -1281,7 +1281,7 @@ v6: tmp = *word; while ((n = ffsl(tmp))) { tmp &= ~(1UL << (n - 1)); - udp_timer_one(c, v6, t, i * 8 + n - 1, ts); + udp_timer_one(c, v6, t, i * 8 + n - 1, now); } } } diff --git a/udp.h b/udp.h index 85ebaaa..087e482 100644 --- a/udp.h +++ b/udp.h @@ -17,7 +17,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int af, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port); int udp_init(struct ctx *c); -void udp_timer(struct ctx *c, const struct timespec *ts); +void udp_timer(struct ctx *c, const struct timespec *now); void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); /** -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 03/13] tcp, tcp_splice: Remove redundant handling from tcp_timer()

tcp_timer() scans the connection table, expiring "tap" connections and calling tcp_splice_timer() for "splice" connections. tcp_splice_timer() expires spliced connections and then does some other processing. However, tcp_timer() is always called shortly after tcp_defer_handler() (from post_handler()), which also scans the flow table expiring both tap and spliced connections. So remove the redundant handling, and only do the extra tcp_splice_timer() work from tcp_timer(). Signed-off-by: David Gibson --- tcp.c | 15 ++------------- tcp_conn.h | 2 +- tcp_splice.c | 7 ++----- 3 files changed, 5 insertions(+), 19 deletions(-) diff --git a/tcp.c b/tcp.c index 1628896..85d1146 100644 --- a/tcp.c +++ b/tcp.c @@ -3200,20 +3200,9 @@ void tcp_timer(struct ctx *c, const struct timespec *now) } } - for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { - switch (flow->f.type) { - case FLOW_TCP: - if (flow->tcp.events == CLOSED) - tcp_conn_destroy(c, flow); - break; - case FLOW_TCP_SPLICE: + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) + if (flow->f.type == FLOW_TCP_SPLICE) tcp_splice_timer(c, flow); - break; - default: - die("Unexpected %s in tcp_timer()", - FLOW_TYPE(&flow->f)); - } - } tcp_sock_refill_init(c); if (c->mode == MODE_PASTA) diff --git a/tcp_conn.h b/tcp_conn.h index e3400bb..e98559c 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -159,7 +159,7 @@ void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, struct tcp_tap_conn *new); void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); void tcp_splice_destroy(struct ctx *c, union flow *flow); -void tcp_splice_timer(struct ctx *c, union flow *flow); +void tcp_splice_timer(const struct ctx *c, union flow *flow); int tcp_conn_pool_sock(int pool[]); int tcp_conn_new_sock(const struct ctx *c, sa_family_t af); void tcp_sock_refill_pool(const struct ctx *c, int pool[], int af); diff --git a/tcp_splice.c b/tcp_splice.c index 052f989..cf9b4e8 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -755,15 +755,12 @@ void tcp_splice_init(struct ctx *c) * @c: Execution context * @flow: Flow table entry */ -void tcp_splice_timer(struct ctx *c, union flow *flow) +void tcp_splice_timer(const struct ctx *c, union flow *flow) { struct tcp_splice_conn *conn = &flow->tcp_splice; int side; - if (conn->flags & CLOSING) { - tcp_splice_destroy(c, flow); - return; - } + ASSERT(!(conn->flags & CLOSING)); for (side = 0; side < SIDES; side++) { uint8_t set = side == 0 ? RCVLOWAT_SET_0 : RCVLOWAT_SET_1; -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 04/13] tcp, tcp_splice: Move per-type cleanup logic into per-type helpers

tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally on the connection being closed or closing. Move that logic into the "destroy" functions themselves, renaming them tcp_flow_defer() and tcp_splice_flow_defer(). Signed-off-by: David Gibson --- tcp.c | 13 +++++++------ tcp_conn.h | 2 +- tcp_splice.c | 9 ++++++--- 3 files changed, 14 insertions(+), 10 deletions(-) diff --git a/tcp.c b/tcp.c index 85d1146..ad1a70d 100644 --- a/tcp.c +++ b/tcp.c @@ -1302,14 +1302,17 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, } /** - * tcp_conn_destroy() - Close sockets, trigger hash table removal and compaction + * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections) * @c: Execution context * @flow: Flow table entry for this connection */ -static void tcp_conn_destroy(struct ctx *c, union flow *flow) +static void tcp_flow_defer(struct ctx *c, union flow *flow) { const struct tcp_tap_conn *conn = &flow->tcp; + if (flow->tcp.events != CLOSED) + return; + close(conn->sock); if (conn->timer != -1) close(conn->timer); @@ -1371,12 +1374,10 @@ void tcp_defer_handler(struct ctx *c) for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { switch (flow->f.type) { case FLOW_TCP: - if (flow->tcp.events == CLOSED) - tcp_conn_destroy(c, flow); + tcp_flow_defer(c, flow); break; case FLOW_TCP_SPLICE: - if (flow->tcp_splice.flags & CLOSING) - tcp_splice_destroy(c, flow); + tcp_splice_flow_defer(c, flow); break; default: die("Unexpected %s in tcp_defer_handler()", diff --git a/tcp_conn.h b/tcp_conn.h index e98559c..4846565 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -158,7 +158,7 @@ extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, struct tcp_tap_conn *new); void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); -void tcp_splice_destroy(struct ctx *c, union flow *flow); +void tcp_splice_flow_defer(struct ctx *c, union flow *flow); void tcp_splice_timer(const struct ctx *c, union flow *flow); int tcp_conn_pool_sock(int pool[]); int tcp_conn_new_sock(const struct ctx *c, sa_family_t af); diff --git a/tcp_splice.c b/tcp_splice.c index cf9b4e8..09aa20f 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -243,15 +243,18 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) } /** - * tcp_splice_destroy() - Close spliced connection and pipes, clear + * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed) * @c: Execution context - * @flow: Flow table entry + * @flow: Flow table entry for this connection */ -void tcp_splice_destroy(struct ctx *c, union flow *flow) +void tcp_splice_flow_defer(struct ctx *c, union flow *flow) { struct tcp_splice_conn *conn = &flow->tcp_splice; unsigned side; + if (!(flow->tcp_splice.flags & CLOSING)) + return; + for (side = 0; side < SIDES; side++) { if (conn->events & SPLICE_ESTABLISHED) { /* Flushing might need to block: don't recycle them. */ -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling

tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table. To that end, add a new flow_defer_handler() handling all flow-linked deferred operations. Signed-off-by: David Gibson --- flow.c | 23 +++++++++++++++++++++++ flow.h | 1 + passt.c | 1 + tcp.c | 19 ++----------------- tcp_conn.h | 1 + 5 files changed, 28 insertions(+), 17 deletions(-) diff --git a/flow.c b/flow.c index a1c0a34..0a0402d 100644 --- a/flow.c +++ b/flow.c @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } + +/** + * flow_defer_handler() - Handler for per-flow deferred tasks + * @c: Execution context + */ +void flow_defer_handler(struct ctx *c) +{ + union flow *flow; + + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { + switch (flow->f.type) { + case FLOW_TCP: + tcp_flow_defer(c, flow); + break; + case FLOW_TCP_SPLICE: + tcp_splice_flow_defer(c, flow); + break; + default: + /* Assume other flow types don't need any handling */ + ; + } + } +} diff --git a/flow.h b/flow.h index 959b461..6b17fa8 100644 --- a/flow.h +++ b/flow.h @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow; void flow_table_compact(struct ctx *c, union flow *hole); +void flow_defer_handler(struct ctx *c); void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) __attribute__((format(printf, 3, 4))); diff --git a/passt.c b/passt.c index 0246b04..5f72a28 100644 --- a/passt.c +++ b/passt.c @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now) /* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */ CALL_PROTO_HANDLER(c, now, icmp, ICMP); + flow_defer_handler(c); #undef CALL_PROTO_HANDLER } diff --git a/tcp.c b/tcp.c index ad1a70d..9230d80 100644 --- a/tcp.c +++ b/tcp.c @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, * @c: Execution context * @flow: Flow table entry for this connection */ -static void tcp_flow_defer(struct ctx *c, union flow *flow) +void tcp_flow_defer(struct ctx *c, union flow *flow) { const struct tcp_tap_conn *conn = &flow->tcp; @@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c) * tcp_defer_handler() - Handler for TCP deferred tasks * @c: Execution context */ +/* cppcheck-suppress constParameterPointer */ void tcp_defer_handler(struct ctx *c) { - union flow *flow; - tcp_l2_flags_buf_flush(c); tcp_l2_data_buf_flush(c); - - for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { - switch (flow->f.type) { - case FLOW_TCP: - tcp_flow_defer(c, flow); - break; - case FLOW_TCP_SPLICE: - tcp_splice_flow_defer(c, flow); - break; - default: - die("Unexpected %s in tcp_defer_handler()", - FLOW_TYPE(&flow->f)); - } - } } /** diff --git a/tcp_conn.h b/tcp_conn.h index 4846565..72b9ecb 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -158,6 +158,7 @@ extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, struct tcp_tap_conn *new); void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); +void tcp_flow_defer(struct ctx *c, union flow *flow); void tcp_splice_flow_defer(struct ctx *c, union flow *flow); void tcp_splice_timer(const struct ctx *c, union flow *flow); int tcp_conn_pool_sock(int pool[]); -- 2.43.0

Stefano Brivio

28 Dec 28 Dec

7:24 p.m.

New subject: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling

On Thu, 21 Dec 2023 17:15:41 +1100 David Gibson wrote:

...

tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table.

To that end, add a new flow_defer_handler() handling all flow-linked deferred operations.

Signed-off-by: David Gibson --- flow.c | 23 +++++++++++++++++++++++ flow.h | 1 + passt.c | 1 + tcp.c | 19 ++----------------- tcp_conn.h | 1 + 5 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/flow.c b/flow.c index a1c0a34..0a0402d 100644 --- a/flow.c +++ b/flow.c @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)

logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } + +/** + * flow_defer_handler() - Handler for per-flow deferred tasks + * @c: Execution context + */ +void flow_defer_handler(struct ctx *c) +{ + union flow *flow; + + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { + switch (flow->f.type) { + case FLOW_TCP: + tcp_flow_defer(c, flow); + break; + case FLOW_TCP_SPLICE: + tcp_splice_flow_defer(c, flow); + break; + default: + /* Assume other flow types don't need any handling */ + ; + } + } +} diff --git a/flow.h b/flow.h index 959b461..6b17fa8 100644 --- a/flow.h +++ b/flow.h @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow;

void flow_table_compact(struct ctx *c, union flow *hole); +void flow_defer_handler(struct ctx *c);

void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) __attribute__((format(printf, 3, 4))); diff --git a/passt.c b/passt.c index 0246b04..5f72a28 100644 --- a/passt.c +++ b/passt.c @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now) /* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */ CALL_PROTO_HANDLER(c, now, icmp, ICMP);

+ flow_defer_handler(c); #undef CALL_PROTO_HANDLER }

diff --git a/tcp.c b/tcp.c index ad1a70d..9230d80 100644 --- a/tcp.c +++ b/tcp.c @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, * @c: Execution context * @flow: Flow table entry for this connection */ -static void tcp_flow_defer(struct ctx *c, union flow *flow) +void tcp_flow_defer(struct ctx *c, union flow *flow) { const struct tcp_tap_conn *conn = &flow->tcp;

@@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c) * tcp_defer_handler() - Handler for TCP deferred tasks * @c: Execution context */ +/* cppcheck-suppress constParameterPointer */

This needs to be: /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */ otherwise we get warnings with cppcheck 2.10, and we'll get warnings if cppcheck's behaviour ever changes again. -- Stefano

David Gibson

31 Dec 31 Dec

6:56 a.m.

New subject: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling

On Thu, Dec 28, 2023 at 07:24:46PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 17:15:41 +1100 David Gibson wrote:

...
tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table.

To that end, add a new flow_defer_handler() handling all flow-linked deferred operations.

Signed-off-by: David Gibson --- flow.c | 23 +++++++++++++++++++++++ flow.h | 1 + passt.c | 1 + tcp.c | 19 ++----------------- tcp_conn.h | 1 + 5 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/flow.c b/flow.c index a1c0a34..0a0402d 100644 --- a/flow.c +++ b/flow.c @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)

logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } + +/** + * flow_defer_handler() - Handler for per-flow deferred tasks + * @c: Execution context + */ +void flow_defer_handler(struct ctx *c) +{ + union flow *flow; + + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { + switch (flow->f.type) { + case FLOW_TCP: + tcp_flow_defer(c, flow); + break; + case FLOW_TCP_SPLICE: + tcp_splice_flow_defer(c, flow); + break; + default: + /* Assume other flow types don't need any handling */ + ; + } + } +} diff --git a/flow.h b/flow.h index 959b461..6b17fa8 100644 --- a/flow.h +++ b/flow.h @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow;

void flow_table_compact(struct ctx *c, union flow *hole); +void flow_defer_handler(struct ctx *c);

void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) __attribute__((format(printf, 3, 4))); diff --git a/passt.c b/passt.c index 0246b04..5f72a28 100644 --- a/passt.c +++ b/passt.c @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now) /* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */ CALL_PROTO_HANDLER(c, now, icmp, ICMP);

+ flow_defer_handler(c); #undef CALL_PROTO_HANDLER }

diff --git a/tcp.c b/tcp.c index ad1a70d..9230d80 100644 --- a/tcp.c +++ b/tcp.c @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, * @c: Execution context * @flow: Flow table entry for this connection */ -static void tcp_flow_defer(struct ctx *c, union flow *flow) +void tcp_flow_defer(struct ctx *c, union flow *flow) { const struct tcp_tap_conn *conn = &flow->tcp;

@@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c) * tcp_defer_handler() - Handler for TCP deferred tasks * @c: Execution context */ +/* cppcheck-suppress constParameterPointer */

This needs to be:

/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */

otherwise we get warnings with cppcheck 2.10,

Drat, do we? I was hoping this was a new warning type with the newer cppcheck, and it would ignore the suppression if it was for a warning type it didn't know about.

...

and we'll get warnings if cppcheck's behaviour ever changes again.

That's actually a good thing. This one isn't a workaround for a cppcheck false positive or weird semantic that we hope will go away. Rhe warning is real and correct as far as it goes. The problem is that the signature needs to match that of other deferred handlers because of how we generate the calls from a macro. Some of those others need write access to the context. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

2 Jan 2 Jan

7:13 p.m.

New subject: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling

On Sun, 31 Dec 2023 16:56:30 +1100 David Gibson wrote:

...

On Thu, Dec 28, 2023 at 07:24:46PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:41 +1100 David Gibson wrote:

...
tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table.

To that end, add a new flow_defer_handler() handling all flow-linked deferred operations.

Signed-off-by: David Gibson --- flow.c | 23 +++++++++++++++++++++++ flow.h | 1 + passt.c | 1 + tcp.c | 19 ++----------------- tcp_conn.h | 1 + 5 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/flow.c b/flow.c index a1c0a34..0a0402d 100644 --- a/flow.c +++ b/flow.c @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)

logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } + +/** + * flow_defer_handler() - Handler for per-flow deferred tasks + * @c: Execution context + */ +void flow_defer_handler(struct ctx *c) +{ + union flow *flow; + + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { + switch (flow->f.type) { + case FLOW_TCP: + tcp_flow_defer(c, flow); + break; + case FLOW_TCP_SPLICE: + tcp_splice_flow_defer(c, flow); + break; + default: + /* Assume other flow types don't need any handling */ + ; + } + } +} diff --git a/flow.h b/flow.h index 959b461..6b17fa8 100644 --- a/flow.h +++ b/flow.h @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow;

void flow_table_compact(struct ctx *c, union flow *hole); +void flow_defer_handler(struct ctx *c);

void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) __attribute__((format(printf, 3, 4))); diff --git a/passt.c b/passt.c index 0246b04..5f72a28 100644 --- a/passt.c +++ b/passt.c @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now) /* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */ CALL_PROTO_HANDLER(c, now, icmp, ICMP);

+ flow_defer_handler(c); #undef CALL_PROTO_HANDLER }

diff --git a/tcp.c b/tcp.c index ad1a70d..9230d80 100644 --- a/tcp.c +++ b/tcp.c @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, * @c: Execution context * @flow: Flow table entry for this connection */ -static void tcp_flow_defer(struct ctx *c, union flow *flow) +void tcp_flow_defer(struct ctx *c, union flow *flow) { const struct tcp_tap_conn *conn = &flow->tcp;

@@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c) * tcp_defer_handler() - Handler for TCP deferred tasks * @c: Execution context */ +/* cppcheck-suppress constParameterPointer */

This needs to be:

/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */

otherwise we get warnings with cppcheck 2.10,

Drat, do we? I was hoping this was a new warning type with the newer cppcheck, and it would ignore the suppression if it was for a warning type it didn't know about.

Yeah... go figure. On the other hand it's not really a new type in the sense that this _should_ have been covered by a "constParameter" warning, before 2.11.

...

...
and we'll get warnings if cppcheck's behaviour ever changes again.

That's actually a good thing. This one isn't a workaround for a cppcheck false positive or weird semantic that we hope will go away. Rhe warning is real and correct as far as it goes. The problem is that the signature needs to match that of other deferred handlers because of how we generate the calls from a macro. Some of those others need write access to the context.

Oops, I didn't realise. Well, in any case, then we don't expect cppcheck's behaviour to ever change in this regard, so I don't see any advantage omitting unmatchedSuppression here. -- Stefano

David Gibson

3 Jan 3 Jan

4:45 a.m.

New subject: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling

On Tue, Jan 02, 2024 at 07:13:28PM +0100, Stefano Brivio wrote:

...

On Sun, 31 Dec 2023 16:56:30 +1100 David Gibson wrote:

...
On Thu, Dec 28, 2023 at 07:24:46PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:41 +1100 David Gibson wrote:

...
tcp_defer_handler(), amongst other things, scans the flow table and does some processing for each TCP connection. When we add other protocols to the flow table, they're likely to want some similar scanning. It makes more sense for cache friendliness to perform a single scan of the flow table and dispatch to the protocol specific handlers, rather than having each protocol separately scan the table.

To that end, add a new flow_defer_handler() handling all flow-linked deferred operations.

Signed-off-by: David Gibson --- flow.c | 23 +++++++++++++++++++++++ flow.h | 1 + passt.c | 1 + tcp.c | 19 ++----------------- tcp_conn.h | 1 + 5 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/flow.c b/flow.c index a1c0a34..0a0402d 100644 --- a/flow.c +++ b/flow.c @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)

logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } + +/** + * flow_defer_handler() - Handler for per-flow deferred tasks + * @c: Execution context + */ +void flow_defer_handler(struct ctx *c) +{ + union flow *flow; + + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { + switch (flow->f.type) { + case FLOW_TCP: + tcp_flow_defer(c, flow); + break; + case FLOW_TCP_SPLICE: + tcp_splice_flow_defer(c, flow); + break; + default: + /* Assume other flow types don't need any handling */ + ; + } + } +} diff --git a/flow.h b/flow.h index 959b461..6b17fa8 100644 --- a/flow.h +++ b/flow.h @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow;

void flow_table_compact(struct ctx *c, union flow *hole); +void flow_defer_handler(struct ctx *c);

void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) __attribute__((format(printf, 3, 4))); diff --git a/passt.c b/passt.c index 0246b04..5f72a28 100644 --- a/passt.c +++ b/passt.c @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now) /* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */ CALL_PROTO_HANDLER(c, now, icmp, ICMP);

+ flow_defer_handler(c); #undef CALL_PROTO_HANDLER }

diff --git a/tcp.c b/tcp.c index ad1a70d..9230d80 100644 --- a/tcp.c +++ b/tcp.c @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, * @c: Execution context * @flow: Flow table entry for this connection */ -static void tcp_flow_defer(struct ctx *c, union flow *flow) +void tcp_flow_defer(struct ctx *c, union flow *flow) { const struct tcp_tap_conn *conn = &flow->tcp;

@@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c) * tcp_defer_handler() - Handler for TCP deferred tasks * @c: Execution context */ +/* cppcheck-suppress constParameterPointer */

This needs to be:

/* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */

otherwise we get warnings with cppcheck 2.10,

Drat, do we? I was hoping this was a new warning type with the newer cppcheck, and it would ignore the suppression if it was for a warning type it didn't know about.

Yeah... go figure. On the other hand it's not really a new type in the sense that this _should_ have been covered by a "constParameter" warning, before 2.11.

Dang. Oh well, updated.

...

...
...
and we'll get warnings if cppcheck's behaviour ever changes again.

That's actually a good thing. This one isn't a workaround for a cppcheck false positive or weird semantic that we hope will go away. Rhe warning is real and correct as far as it goes. The problem is that the signature needs to match that of other deferred handlers because of how we generate the calls from a macro. Some of those others need write access to the context.

Oops, I didn't realise. Well, in any case, then we don't expect cppcheck's behaviour to ever change in this regard, so I don't see any advantage omitting unmatchedSuppression here.

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

21 Dec 21 Dec

7:15 a.m.

New subject: [PATCH v3 06/13] flow, tcp: Add handling for per-flow timers

tcp_timer() scans the flow table so that it can run tcp_splice_timer() on each spliced connection. More generally, other flow types might want to run similar timers in future. We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc. However, this would need to scan the flow table, which we would have just done in flow_defer_handler(). We'd prefer to just scan the flow table once, dispatching both per-flow deferred events and per-flow timed events if necessary. So, extend flow_defer_handler() to do this. For now we use the same timer interval for all flow types (1s). We can make that more flexible in future if we need to. Signed-off-by: David Gibson --- flow.c | 16 ++++++++++++++-- flow.h | 4 +++- passt.c | 7 ++++--- tcp.c | 6 ------ 4 files changed, 21 insertions(+), 12 deletions(-) diff --git a/flow.c b/flow.c index 0a0402d..ef129db 100644 --- a/flow.c +++ b/flow.c @@ -27,6 +27,9 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, /* Global Flow Table */ union flow flowtab[FLOW_MAX]; +/* Last time the flow timers ran */ +static struct timespec flow_timer_run; + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context @@ -85,13 +88,20 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) } /** - * flow_defer_handler() - Handler for per-flow deferred tasks + * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context + * @now: Current timestamp */ -void flow_defer_handler(struct ctx *c) +void flow_defer_handler(struct ctx *c, const struct timespec *now) { + bool timer = false; union flow *flow; + if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) { + timer = true; + flow_timer_run = *now; + } + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { switch (flow->f.type) { case FLOW_TCP: @@ -99,6 +109,8 @@ void flow_defer_handler(struct ctx *c) break; case FLOW_TCP_SPLICE: tcp_splice_flow_defer(c, flow); + if (timer) + tcp_splice_timer(c, flow); break; default: /* Assume other flow types don't need any handling */ diff --git a/flow.h b/flow.h index 6b17fa8..423e685 100644 --- a/flow.h +++ b/flow.h @@ -7,6 +7,8 @@ #ifndef FLOW_H #define FLOW_H +#define FLOW_TIMER_INTERVAL 1000 /* ms */ + /** * enum flow_type - Different types of packet flows we track */ @@ -67,7 +69,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow; void flow_table_compact(struct ctx *c, union flow *hole); -void flow_defer_handler(struct ctx *c); +void flow_defer_handler(struct ctx *c, const struct timespec *now); void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) __attribute__((format(printf, 3, 4))); diff --git a/passt.c b/passt.c index 5f72a28..870064f 100644 --- a/passt.c +++ b/passt.c @@ -53,8 +53,9 @@ #define EPOLL_EVENTS 8 -#define __TIMER_INTERVAL MIN(TCP_TIMER_INTERVAL, UDP_TIMER_INTERVAL) -#define TIMER_INTERVAL MIN(__TIMER_INTERVAL, ICMP_TIMER_INTERVAL) +#define TIMER_INTERVAL__ MIN(TCP_TIMER_INTERVAL, UDP_TIMER_INTERVAL) +#define TIMER_INTERVAL_ MIN(TIMER_INTERVAL__, ICMP_TIMER_INTERVAL) +#define TIMER_INTERVAL MIN(TIMER_INTERVAL_, FLOW_TIMER_INTERVAL) char pkt_buf[PKT_BUF_BYTES] __attribute__ ((aligned(PAGE_SIZE))); @@ -103,7 +104,7 @@ static void post_handler(struct ctx *c, const struct timespec *now) /* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */ CALL_PROTO_HANDLER(c, now, icmp, ICMP); - flow_defer_handler(c); + flow_defer_handler(c, now); #undef CALL_PROTO_HANDLER } diff --git a/tcp.c b/tcp.c index 9230d80..b936fce 100644 --- a/tcp.c +++ b/tcp.c @@ -3170,8 +3170,6 @@ static int tcp_port_rebind_outbound(void *arg) */ void tcp_timer(struct ctx *c, const struct timespec *now) { - union flow *flow; - (void)now; if (c->mode == MODE_PASTA) { @@ -3186,10 +3184,6 @@ void tcp_timer(struct ctx *c, const struct timespec *now) } } - for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) - if (flow->f.type == FLOW_TCP_SPLICE) - tcp_splice_timer(c, flow); - tcp_sock_refill_init(c); if (c->mode == MODE_PASTA) tcp_splice_refill(c); -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 07/13] epoll: Better handling of number of epoll types

As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static assert on the size of the matching names array. Signed-off-by: David Gibson --- passt.c | 4 +++- passt.h | 4 ++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/passt.c b/passt.c index 870064f..a37a2f4 100644 --- a/passt.c +++ b/passt.c @@ -59,7 +59,7 @@ char pkt_buf[PKT_BUF_BYTES] __attribute__ ((aligned(PAGE_SIZE))); -char *epoll_type_str[EPOLL_TYPE_MAX + 1] = { +char *epoll_type_str[] = { [EPOLL_TYPE_TCP] = "connected TCP socket", [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", @@ -71,6 +71,8 @@ char *epoll_type_str[EPOLL_TYPE_MAX + 1] = { [EPOLL_TYPE_TAP_PASST] = "connected qemu socket", [EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket", }; +static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, + "epoll_type_str[] doesn't match enum epoll_type"); /** * post_handler() - Run periodic and deferred tasks for L4 protocol handlers diff --git a/passt.h b/passt.h index c74887a..f54023a 100644 --- a/passt.h +++ b/passt.h @@ -70,7 +70,7 @@ enum epoll_type { /* socket listening for qemu socket connections */ EPOLL_TYPE_TAP_LISTEN, - EPOLL_TYPE_MAX = EPOLL_TYPE_TAP_LISTEN, + EPOLL_NUM_TYPES, }; /** @@ -115,7 +115,7 @@ extern char pkt_buf [PKT_BUF_BYTES]; extern char *epoll_type_str[]; #define EPOLL_TYPE_STR(n) \ - (((uint8_t)(n) <= EPOLL_TYPE_MAX && epoll_type_str[(n)]) ? \ + (((uint8_t)(n) < EPOLL_NUM_TYPES && epoll_type_str[(n)]) ? \ epoll_type_str[(n)] : "?") #include /* For MAXNS below */ -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 08/13] tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets

Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson --- passt.c | 8 ++++++-- passt.h | 2 ++ tcp.c | 36 ++++++++---------------------------- tcp_splice.c | 16 +++++++++------- tcp_splice.h | 4 ++-- 5 files changed, 27 insertions(+), 39 deletions(-) diff --git a/passt.c b/passt.c index a37a2f4..71bea8f 100644 --- a/passt.c +++ b/passt.c @@ -50,6 +50,7 @@ #include "pasta.h" #include "arch.h" #include "log.h" +#include "tcp_splice.h" #define EPOLL_EVENTS 8 @@ -61,6 +62,7 @@ char pkt_buf[PKT_BUF_BYTES] __attribute__ ((aligned(PAGE_SIZE))); char *epoll_type_str[] = { [EPOLL_TYPE_TCP] = "connected TCP socket", + [EPOLL_TYPE_TCP_SPLICE] = "connected spliced TCP socket", [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", @@ -373,8 +375,10 @@ loop: pasta_netns_quit_handler(&c, quit_fd); break; case EPOLL_TYPE_TCP: - if (!c.no_tcp) - tcp_sock_handler(&c, ref, eventmask); + tcp_sock_handler(&c, ref, eventmask); + break; + case EPOLL_TYPE_TCP_SPLICE: + tcp_splice_sock_handler(&c, ref, eventmask); break; case EPOLL_TYPE_TCP_LISTEN: tcp_listen_handler(&c, ref, &now); diff --git a/passt.h b/passt.h index f54023a..82b0fcf 100644 --- a/passt.h +++ b/passt.h @@ -51,6 +51,8 @@ enum epoll_type { EPOLL_TYPE_NONE = 0, /* Connected TCP sockets */ EPOLL_TYPE_TCP, + /* Connected TCP sockets (spliced) */ + EPOLL_TYPE_TCP_SPLICE, /* Listening TCP sockets */ EPOLL_TYPE_TCP_LISTEN, /* timerfds used for TCP timers */ diff --git a/tcp.c b/tcp.c index b936fce..f28628a 100644 --- a/tcp.c +++ b/tcp.c @@ -2803,14 +2803,18 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) } /** - * tcp_tap_sock_handler() - Handle new data from non-spliced socket + * tcp_sock_handler() - Handle new data from non-spliced socket * @c: Execution context - * @conn: Connection state + * @ref: epoll reference * @events: epoll events bitmap */ -static void tcp_tap_sock_handler(struct ctx *c, struct tcp_tap_conn *conn, - uint32_t events) +void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) { + struct tcp_tap_conn *conn = CONN(ref.flowside.flow); + + ASSERT(conn->f.type == FLOW_TCP); + ASSERT(ref.flowside.side == SOCKSIDE); + if (conn->events == CLOSED) return; @@ -2857,30 +2861,6 @@ static void tcp_tap_sock_handler(struct ctx *c, struct tcp_tap_conn *conn, } } -/** - * tcp_sock_handler() - Handle new data from socket, or timerfd event - * @c: Execution context - * @ref: epoll reference - * @events: epoll events bitmap - */ -void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) -{ - union flow *flow = FLOW(ref.flowside.flow); - - switch (flow->f.type) { - case FLOW_TCP: - tcp_tap_sock_handler(c, &flow->tcp, events); - break; - case FLOW_TCP_SPLICE: - tcp_splice_sock_handler(c, &flow->tcp_splice, ref.flowside.side, - events); - break; - default: - die("Unexpected %s in tcp_sock_handler_compact()", - FLOW_TYPE(&flow->f)); - } -} - /** * tcp_sock_init_af() - Initialise listening socket for a given af and port * @c: Execution context diff --git a/tcp_splice.c b/tcp_splice.c index 09aa20f..9b1d331 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -127,9 +127,9 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, { int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref[SIDES] = { - { .type = EPOLL_TYPE_TCP, .fd = conn->s[0], + { .type = EPOLL_TYPE_TCP_SPLICE, .fd = conn->s[0], .flowside = FLOW_SIDX(conn, 0) }, - { .type = EPOLL_TYPE_TCP, .fd = conn->s[1], + { .type = EPOLL_TYPE_TCP_SPLICE, .fd = conn->s[1], .flowside = FLOW_SIDX(conn, 1) } }; struct epoll_event ev[SIDES] = { { .data.u64 = ref[0].u64 }, @@ -484,18 +484,20 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, /** * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection * @c: Execution context - * @conn: Connection state - * @side: Side of the connection on which an event has occurred + * @ref: epoll reference * @events: epoll events bitmap * * #syscalls:pasta splice */ -void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn, - int side, uint32_t events) +void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, + uint32_t events) { + struct tcp_splice_conn *conn = CONN(ref.flowside.flow); + unsigned side = ref.flowside.side, fromside; uint8_t lowat_set_flag, lowat_act_flag; int eof, never_read; - unsigned fromside; + + ASSERT(conn->f.type == FLOW_TCP_SPLICE); if (conn->events == SPLICE_CLOSED) return; diff --git a/tcp_splice.h b/tcp_splice.h index aa85c7c..18193e4 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -8,8 +8,8 @@ struct tcp_splice_conn; -void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn, - int side, uint32_t events); +void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, + uint32_t events); bool tcp_splice_conn_from_sock(const struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_splice_conn *conn, int s, -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 09/13] flow: Move flow_log_() to near top of flow.c

flow_log_() is a very basic widely used function that many other functions in flow.c will end up needing. At present it's below flow_table_compact() which happens not to need it, but that's likely to change. Move it to near the top of flow.c to avoid forward declarations. Code motion only, no changes. Signed-off-by: David Gibson --- flow.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/flow.c b/flow.c index ef129db..79c6ae6 100644 --- a/flow.c +++ b/flow.c @@ -30,6 +30,24 @@ union flow flowtab[FLOW_MAX]; /* Last time the flow timers ran */ static struct timespec flow_timer_run; +/** flow_log_ - Log flow-related message + * @f: flow the message is related to + * @pri: Log priority + * @fmt: Format string + * @...: printf-arguments + */ +void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) +{ + char msg[BUFSIZ]; + va_list args; + + va_start(args, fmt); + (void)vsnprintf(msg, sizeof(msg), fmt, args); + va_end(args); + + logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context @@ -69,24 +87,6 @@ void flow_table_compact(struct ctx *c, union flow *hole) memset(from, 0, sizeof(*from)); } -/** flow_log_ - Log flow-related message - * @f: flow the message is related to - * @pri: Log priority - * @fmt: Format string - * @...: printf-arguments - */ -void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) -{ - char msg[BUFSIZ]; - va_list args; - - va_start(args, fmt); - (void)vsnprintf(msg, sizeof(msg), fmt, args); - va_end(args); - - logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); -} - /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global. Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future. Signed-off-by: David Gibson --- flow.c | 11 ++++++----- flow.h | 4 ++-- flow_table.h | 1 + passt.h | 3 --- tcp.c | 10 +++++----- tcp_conn.h | 4 ++-- tcp_splice.c | 2 +- 7 files changed, 17 insertions(+), 18 deletions(-) diff --git a/flow.c b/flow.c index 79c6ae6..294412a 100644 --- a/flow.c +++ b/flow.c @@ -25,6 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); /* Global Flow Table */ +unsigned flow_count; union flow flowtab[FLOW_MAX]; /* Last time the flow timers ran */ @@ -53,18 +54,18 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) * @c: Execution context * @hole: Pointer to recently closed flow */ -void flow_table_compact(struct ctx *c, union flow *hole) +void flow_table_compact(const struct ctx *c, union flow *hole) { union flow *from; - if (FLOW_IDX(hole) == --c->flow_count) { + if (FLOW_IDX(hole) == --flow_count) { debug("flow: table compaction: maximum index was %u (%p)", FLOW_IDX(hole), (void *)hole); memset(hole, 0, sizeof(*hole)); return; } - from = flowtab + c->flow_count; + from = flowtab + flow_count; memcpy(hole, from, sizeof(*hole)); switch (from->f.type) { @@ -92,7 +93,7 @@ void flow_table_compact(struct ctx *c, union flow *hole) * @c: Execution context * @now: Current timestamp */ -void flow_defer_handler(struct ctx *c, const struct timespec *now) +void flow_defer_handler(const struct ctx *c, const struct timespec *now) { bool timer = false; union flow *flow; @@ -102,7 +103,7 @@ void flow_defer_handler(struct ctx *c, const struct timespec *now) flow_timer_run = *now; } - for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { + for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { switch (flow->f.type) { case FLOW_TCP: tcp_flow_defer(c, flow); diff --git a/flow.h b/flow.h index 423e685..44058bf 100644 --- a/flow.h +++ b/flow.h @@ -68,8 +68,8 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow; -void flow_table_compact(struct ctx *c, union flow *hole); -void flow_defer_handler(struct ctx *c, const struct timespec *now); +void flow_table_compact(const struct ctx *c, union flow *hole); +void flow_defer_handler(const struct ctx *c, const struct timespec *now); void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) __attribute__((format(printf, 3, 4))); diff --git a/flow_table.h b/flow_table.h index e805f10..4aa2398 100644 --- a/flow_table.h +++ b/flow_table.h @@ -22,6 +22,7 @@ union flow { }; /* Global Flow Table */ +extern unsigned flow_count; extern union flow flowtab[]; diff --git a/passt.h b/passt.h index 82b0fcf..a9e8f15 100644 --- a/passt.h +++ b/passt.h @@ -224,7 +224,6 @@ struct ip6_ctx { * @pasta_conf_ns: Configure namespace after creating it * @no_copy_routes: Don't copy all routes when configuring target namespace * @no_copy_addrs: Don't copy all addresses when configuring namespace - * @flow_count: Number of tracked packet flows (connections etc.) * @no_tcp: Disable TCP operation * @tcp: Context for TCP protocol handler * @no_tcp: Disable UDP operation @@ -284,8 +283,6 @@ struct ctx { int no_copy_routes; int no_copy_addrs; - unsigned flow_count; - int no_tcp; struct tcp_ctx tcp; int no_udp; diff --git a/tcp.c b/tcp.c index f28628a..6aefd0b 100644 --- a/tcp.c +++ b/tcp.c @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, * @c: Execution context * @flow: Flow table entry for this connection */ -void tcp_flow_defer(struct ctx *c, union flow *flow) +void tcp_flow_defer(const struct ctx *c, union flow *flow) { const struct tcp_tap_conn *conn = &flow->tcp; @@ -1948,7 +1948,7 @@ static void tcp_conn_from_tap(struct ctx *c, (void)saddr; - if (c->flow_count >= FLOW_MAX) + if (flow_count >= FLOW_MAX) return; if ((s = tcp_conn_pool_sock(pool)) < 0) @@ -1974,7 +1974,7 @@ static void tcp_conn_from_tap(struct ctx *c, } } - conn = CONN(c->flow_count++); + conn = CONN(flow_count++); conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; @@ -2723,14 +2723,14 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, union flow *flow; int s; - if (c->no_tcp || c->flow_count >= FLOW_MAX) + if (c->no_tcp || flow_count >= FLOW_MAX) return; s = accept4(ref.fd, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK); if (s < 0) return; - flow = flowtab + c->flow_count++; + flow = flowtab + flow_count++; if (c->mode == MODE_PASTA && tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, diff --git a/tcp_conn.h b/tcp_conn.h index 72b9ecb..825155a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -158,8 +158,8 @@ extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, struct tcp_tap_conn *new); void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); -void tcp_flow_defer(struct ctx *c, union flow *flow); -void tcp_splice_flow_defer(struct ctx *c, union flow *flow); +void tcp_flow_defer(const struct ctx *c, union flow *flow); +void tcp_splice_flow_defer(const struct ctx *c, union flow *flow); void tcp_splice_timer(const struct ctx *c, union flow *flow); int tcp_conn_pool_sock(int pool[]); int tcp_conn_new_sock(const struct ctx *c, sa_family_t af); diff --git a/tcp_splice.c b/tcp_splice.c index 9b1d331..18cd2e6 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -247,7 +247,7 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) * @c: Execution context * @flow: Flow table entry for this connection */ -void tcp_splice_flow_defer(struct ctx *c, union flow *flow) +void tcp_splice_flow_defer(const struct ctx *c, union flow *flow) { struct tcp_splice_conn *conn = &flow->tcp_splice; unsigned side; -- 2.43.0

Stefano Brivio

28 Dec 28 Dec

7:25 p.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

On Thu, 21 Dec 2023 17:15:46 +1100 David Gibson wrote:

...

In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global.

Well, the reason is that flow_tab[FLOW_MAX] might be problematically too big to live on the stack, unlike flow_count. But anyway, as far as thoughts of multithreading are concerned, both should probably be global. And sure, it's more consistent this way.

...

Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future.

If it's not static, it should be initialised, and that's not done here. This becomes 'flow_first_free' in 13/13, but it's not initialised either, and that should also start off as zero. -- Stefano

David Gibson

31 Dec 31 Dec

6:58 a.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 17:15:46 +1100 David Gibson wrote:

...
In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global.

Well, the reason is that flow_tab[FLOW_MAX] might be problematically too big to live on the stack, unlike flow_count.

But anyway, as far as thoughts of multithreading are concerned, both should probably be global. And sure, it's more consistent this way.

...
Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future.

If it's not static, it should be initialised, and that's not done here.

Uh... what? "static" here is meaning module-global rather than global-global, which has no bearing on initialisation. AFAIK globals are zero-initialised whether they're static or not.

...

This becomes 'flow_first_free' in 13/13, but it's not initialised either, and that should also start off as zero.

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

2 Jan 2 Jan

7:13 p.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

On Sun, 31 Dec 2023 16:58:39 +1100 David Gibson wrote:

...

On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:46 +1100 David Gibson wrote:

...
In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global.

Well, the reason is that flow_tab[FLOW_MAX] might be problematically too big to live on the stack, unlike flow_count.

But anyway, as far as thoughts of multithreading are concerned, both should probably be global. And sure, it's more consistent this way.

...
Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future.

If it's not static, it should be initialised, and that's not done here.

Uh... what? "static" here is meaning module-global rather than global-global, which has no bearing on initialisation. AFAIK globals are zero-initialised whether they're static or not.

...and to my utter surprise, I just discovered that if you talk C11, you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9 "Initialization", clause 10: If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static or thread storage duration is not initialized explicitly, then: [...] — if it has arithmetic type, it is initialized to (positive or unsigned) zero; And 'flow_count' has thread storage duration. In C99, however (draft N1256), Section 6.7.8 "Initialization", clause 10: If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then: [...] note the missing "or thread storage duration". C89, the one I was actually basing my observation on, says, at 3.5.7 "Initialization": If an object that has static storage duration is not initialized explicitly, it is initialized implicitly as if every member that has arithmetic type were assigned 0 and every member that has pointer type were assigned a null pointer constant. If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. so... um. We won't go back to C99. But to me, and maybe others, not having a "= 0;" for a "global" means pretty much that we don't rely on any particular initial value. If you're strongly against it, fine, C11 says it's zero... but it makes me a bit nervous nevertheless. -- Stefano

David Gibson

3 Jan 3 Jan

4:54 a.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:

...

On Sun, 31 Dec 2023 16:58:39 +1100 David Gibson wrote:

...
On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:46 +1100 David Gibson wrote:

...
In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global.

Well, the reason is that flow_tab[FLOW_MAX] might be problematically too big to live on the stack, unlike flow_count.

But anyway, as far as thoughts of multithreading are concerned, both should probably be global. And sure, it's more consistent this way.

...
Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future.

If it's not static, it should be initialised, and that's not done here.

Uh... what? "static" here is meaning module-global rather than global-global, which has no bearing on initialisation. AFAIK globals are zero-initialised whether they're static or not.

...and to my utter surprise, I just discovered that if you talk C11, you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static or thread storage duration is not initialized explicitly, then:

[...]

— if it has arithmetic type, it is initialized to (positive or unsigned) zero;

And 'flow_count' has thread storage duration.

No.. I don't think it does. AFAICT only thread-local variables have thread storage duration. As a global flow_count will have static storage duration, even without the static keyword.

...

In C99, however (draft N1256), Section 6.7.8 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:

[...]

note the missing "or thread storage duration".

C89, the one I was actually basing my observation on, says, at 3.5.7 "Initialization":

If an object that has static storage duration is not initialized explicitly, it is initialized implicitly as if every member that has arithmetic type were assigned 0 and every member that has pointer type were assigned a null pointer constant. If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate.

so... um. We won't go back to C99. But to me, and maybe others, not having a "= 0;" for a "global" means pretty much that we don't rely on any particular initial value.

Again, I'm pretty sure that's not true, even for C99 and C89. AIUI, 'static' locals and *all* globals have "static storage diration". I'm not sure where to get the actual text of the standards but see for example https://en.cppreference.com/w/c/language/static_storage_duration Here 'flow_count' has external linkage, thus satisfying the conditions for static storage duration. Fwiw, I'm pretty sure the kernel has relied on zero-initialization of non-static globals for many years.

...

If you're strongly against it, fine, C11 says it's zero... but it makes me a bit nervous nevertheless.

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

8:08 a.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

On Wed, 3 Jan 2024 14:54:27 +1100 David Gibson wrote:

...

I'm not sure where to get the actual text of the standards

Let me answer this first: one (the?) trick is to use so-called final drafts, which are made freely available (same as working drafts) by the Working Group. Those are not the same as the standards, but differences from the final draft are also published... and they are usually not substantial. This is _very_ informative: https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c... Wikipedia also has the links, by the way. Anyway, in practice: - C11: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf - C99: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf - C89: https://web.archive.org/web/20200909074736if_/https://www.pdf-archive.com/20...

...

On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:

...
On Sun, 31 Dec 2023 16:58:39 +1100 David Gibson wrote:

...
On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:46 +1100 David Gibson wrote:

...
In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global.

Well, the reason is that flow_tab[FLOW_MAX] might be problematically too big to live on the stack, unlike flow_count.

But anyway, as far as thoughts of multithreading are concerned, both should probably be global. And sure, it's more consistent this way.

...
Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future.

If it's not static, it should be initialised, and that's not done here.

Uh... what? "static" here is meaning module-global rather than global-global, which has no bearing on initialisation. AFAIK globals are zero-initialised whether they're static or not.

...and to my utter surprise, I just discovered that if you talk C11, you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static or thread storage duration is not initialized explicitly, then:

[...]

— if it has arithmetic type, it is initialized to (positive or unsigned) zero;

And 'flow_count' has thread storage duration.

No.. I don't think it does. AFAICT only thread-local variables have thread storage duration. As a global flow_count will have static storage duration, even without the static keyword.

So, C11 defines static storage duration here: 6.2.4 Storage durations of objects [...] 3 An object whose identifier is declared without the storage-class specifier _Thread_local, and either with external or internal linkage or with the storage-class specifier static, has static storage duration. Its lifetime is the entire execution of the program and its stored value is initialized only once, prior to program startup. do we have any linkage here? I would have said no -- but, going back to C99 for this, "6.2.2 Linkages of identifiers": 5 [...] If the declaration of an identifier for an object has file scope and no storage-class specifier, its linkage is external. which supports your paragraph below. By the way, C11 now says: 6.11.2 Linkages of identifiers 1 Declaring an identifier with internal linkage at file scope without the static storage-class specifier is an obsolescent feature

...

...
In C99, however (draft N1256), Section 6.7.8 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:

[...]

note the missing "or thread storage duration".

C89, the one I was actually basing my observation on, says, at 3.5.7 "Initialization":

If an object that has static storage duration is not initialized explicitly, it is initialized implicitly as if every member that has arithmetic type were assigned 0 and every member that has pointer type were assigned a null pointer constant. If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate.

so... um. We won't go back to C99. But to me, and maybe others, not having a "= 0;" for a "global" means pretty much that we don't rely on any particular initial value.

Again, I'm pretty sure that's not true, even for C99 and C89. AIUI, 'static' locals and *all* globals have "static storage diration".

I'm not sure where to get the actual text of the standards but see for example

https://en.cppreference.com/w/c/language/static_storage_duration

Here 'flow_count' has external linkage, thus satisfying the conditions for static storage duration.

Right. Well, for C99 and C11 at least. For C89 things are slightly different: 6.1.2.4 Storage durations of objects [...] An object whose identifier is declared with external or internal linkage. or with the storage-class specifier static has static storage duration. [...] An object whose identifier is declared with no linkage and without the storage-class specifier static has automatic storage duration. You might say it has external linkage. But it was not *declared with* external linkage -- it just happens to have it (C89 and C99 don't differ here).

...

Fwiw, I'm pretty sure the kernel has relied on zero-initialization of non-static globals for many years.

True, and the opposite is even considered as a style issue since 2007, commit f0a594c1c74f ("update checkpatch.pl to version 0.08"). I also found a discussion similar to this one: https://lore.kernel.org/all/20201102184147.GA42288@localhost/#r Anyway... a couple of years before that, it must have been a gcc version in the late 2.x, I actually hit an issue with it. Was it a compiler issue, or the correct interpretation of C89? Or maybe something on the lines of: https://www.thegoodpenguin.co.uk/blog/u-boot-relocation-bss-hang/ ? I'll never know/remember. And then, after reading the standard, I started obsessively adding those = 0. Well, good to know I can stop doing it now. :) -- Stefano

David Gibson

4 Jan 4 Jan

10:51 a.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

On Wed, Jan 03, 2024 at 08:08:34AM +0100, Stefano Brivio wrote:

...

On Wed, 3 Jan 2024 14:54:27 +1100 David Gibson wrote:

...
I'm not sure where to get the actual text of the standards

Let me answer this first: one (the?) trick is to use so-called final drafts, which are made freely available (same as working drafts) by the Working Group.

Those are not the same as the standards, but differences from the final draft are also published... and they are usually not substantial.

This is _very_ informative: https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c...

Ah, thanks.

...

Wikipedia also has the links, by the way. Anyway, in practice:

- C11: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf - C99: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf - C89: https://web.archive.org/web/20200909074736if_/https://www.pdf-archive.com/20...

...
On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:

...
On Sun, 31 Dec 2023 16:58:39 +1100 David Gibson wrote:

...
On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:46 +1100 David Gibson wrote:

...
In general, the passt code is a bit haphazard about what's a true global variable and what's in the quasi-global 'context structure'. The flow_count field is one such example: it's in the context structure, although it's really part of the same data structure as flowtab[], which is a genuine global.

Well, the reason is that flow_tab[FLOW_MAX] might be problematically too big to live on the stack, unlike flow_count.

But anyway, as far as thoughts of multithreading are concerned, both should probably be global. And sure, it's more consistent this way.

...
Move flow_count to be a regular global to match. For now it needs to be public, rather than static, but we expect to be able to change that in future.

If it's not static, it should be initialised, and that's not done here.

Uh... what? "static" here is meaning module-global rather than global-global, which has no bearing on initialisation. AFAIK globals are zero-initialised whether they're static or not.

...and to my utter surprise, I just discovered that if you talk C11, you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static or thread storage duration is not initialized explicitly, then:

[...]

— if it has arithmetic type, it is initialized to (positive or unsigned) zero;

And 'flow_count' has thread storage duration.

No.. I don't think it does. AFAICT only thread-local variables have thread storage duration. As a global flow_count will have static storage duration, even without the static keyword.

So, C11 defines static storage duration here:

6.2.4 Storage durations of objects

[...]

3 An object whose identifier is declared without the storage-class specifier _Thread_local, and either with external or internal linkage or with the storage-class specifier static, has static storage duration. Its lifetime is the entire execution of the program and its stored value is initialized only once, prior to program startup.

do we have any linkage here? I would have said no -- but, going back to C99 for this, "6.2.2 Linkages of identifiers":

5 [...] If the declaration of an identifier for an object has file scope and no storage-class specifier, its linkage is external.

which supports your paragraph below.

Right.

...

By the way, C11 now says:

6.11.2 Linkages of identifiers

1 Declaring an identifier with internal linkage at file scope without the static storage-class specifier is an obsolescent feature

Ok. I'm not even sure how you would do that.

...

...
...
In C99, however (draft N1256), Section 6.7.8 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:

[...]

note the missing "or thread storage duration".

C89, the one I was actually basing my observation on, says, at 3.5.7 "Initialization":

If an object that has static storage duration is not initialized explicitly, it is initialized implicitly as if every member that has arithmetic type were assigned 0 and every member that has pointer type were assigned a null pointer constant. If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate.

so... um. We won't go back to C99. But to me, and maybe others, not having a "= 0;" for a "global" means pretty much that we don't rely on any particular initial value.

Again, I'm pretty sure that's not true, even for C99 and C89. AIUI, 'static' locals and *all* globals have "static storage diration".

I'm not sure where to get the actual text of the standards but see for example

https://en.cppreference.com/w/c/language/static_storage_duration

Here 'flow_count' has external linkage, thus satisfying the conditions for static storage duration.

Right. Well, for C99 and C11 at least. For C89 things are slightly different:

6.1.2.4 Storage durations of objects

[...]

An object whose identifier is declared with external or internal linkage. or with the storage-class specifier static has static storage duration.

[...]

An object whose identifier is declared with no linkage and without the storage-class specifier static has automatic storage duration.

You might say it has external linkage. But it was not *declared with* external linkage -- it just happens to have it (C89 and C99 don't differ here).

Hrm. We do have: extern unsigned flow_first_free; in flow_table.h. Does that cound as declaring with external linkage?

...

...
Fwiw, I'm pretty sure the kernel has relied on zero-initialization of non-static globals for many years.

True, and the opposite is even considered as a style issue since 2007, commit f0a594c1c74f ("update checkpatch.pl to version 0.08"). I also found a discussion similar to this one: https://lore.kernel.org/all/20201102184147.GA42288@localhost/#r

Anyway... a couple of years before that, it must have been a gcc version in the late 2.x, I actually hit an issue with it. Was it a compiler issue, or the correct interpretation of C89? Or maybe something on the lines of: https://www.thegoodpenguin.co.uk/blog/u-boot-relocation-bss-hang/

If it was an embedded setup, that last one is certainly possible. Zeroing the BSS is typically the loader's job, and I've certainly seen loader implementations - particularly in embedded firmware - that got this wrong.

...

? I'll never know/remember. And then, after reading the standard, I started obsessively adding those = 0. Well, good to know I can stop doing it now. :)

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

5 Jan 5 Jan

8:55 a.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

On Thu, 4 Jan 2024 20:51:19 +1100 David Gibson wrote:

...

On Wed, Jan 03, 2024 at 08:08:34AM +0100, Stefano Brivio wrote:

...
On Wed, 3 Jan 2024 14:54:27 +1100 David Gibson wrote:

...
I'm not sure where to get the actual text of the standards

Let me answer this first: one (the?) trick is to use so-called final drafts, which are made freely available (same as working drafts) by the Working Group.

Those are not the same as the standards, but differences from the final draft are also published... and they are usually not substantial.

This is _very_ informative: https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c...

Ah, thanks.

...
Wikipedia also has the links, by the way. Anyway, in practice:

- C11: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf - C99: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf - C89: https://web.archive.org/web/20200909074736if_/https://www.pdf-archive.com/20...

...
On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:

...
On Sun, 31 Dec 2023 16:58:39 +1100 David Gibson wrote:

...
On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:46 +1100 David Gibson wrote:

> In general, the passt code is a bit haphazard about what's a true global > variable and what's in the quasi-global 'context structure'. The > flow_count field is one such example: it's in the context structure, > although it's really part of the same data structure as flowtab[], which > is a genuine global.

Well, the reason is that flow_tab[FLOW_MAX] might be problematically too big to live on the stack, unlike flow_count.

But anyway, as far as thoughts of multithreading are concerned, both should probably be global. And sure, it's more consistent this way.

> Move flow_count to be a regular global to match. For now it needs to be > public, rather than static, but we expect to be able to change that in > future.

If it's not static, it should be initialised, and that's not done here.

Uh... what? "static" here is meaning module-global rather than global-global, which has no bearing on initialisation. AFAIK globals are zero-initialised whether they're static or not.

...and to my utter surprise, I just discovered that if you talk C11, you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static or thread storage duration is not initialized explicitly, then:

[...]

— if it has arithmetic type, it is initialized to (positive or unsigned) zero;

And 'flow_count' has thread storage duration.

No.. I don't think it does. AFAICT only thread-local variables have thread storage duration. As a global flow_count will have static storage duration, even without the static keyword.

So, C11 defines static storage duration here:

6.2.4 Storage durations of objects

[...]

3 An object whose identifier is declared without the storage-class specifier _Thread_local, and either with external or internal linkage or with the storage-class specifier static, has static storage duration. Its lifetime is the entire execution of the program and its stored value is initialized only once, prior to program startup.

do we have any linkage here? I would have said no -- but, going back to C99 for this, "6.2.2 Linkages of identifiers":

5 [...] If the declaration of an identifier for an object has file scope and no storage-class specifier, its linkage is external.

which supports your paragraph below.

Right.

...
By the way, C11 now says:

6.11.2 Linkages of identifiers

1 Declaring an identifier with internal linkage at file scope without the static storage-class specifier is an obsolescent feature

Ok. I'm not even sure how you would do that.

By doing what I *thought* you were doing (see below): "int x" at file scope (outside functions), no static, no extern declaration, nothing.

...

...
...
...
In C99, however (draft N1256), Section 6.7.8 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:

[...]

note the missing "or thread storage duration".

C89, the one I was actually basing my observation on, says, at 3.5.7 "Initialization":

If an object that has static storage duration is not initialized explicitly, it is initialized implicitly as if every member that has arithmetic type were assigned 0 and every member that has pointer type were assigned a null pointer constant. If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate.

so... um. We won't go back to C99. But to me, and maybe others, not having a "= 0;" for a "global" means pretty much that we don't rely on any particular initial value.

Again, I'm pretty sure that's not true, even for C99 and C89. AIUI, 'static' locals and *all* globals have "static storage diration".

I'm not sure where to get the actual text of the standards but see for example

https://en.cppreference.com/w/c/language/static_storage_duration

Here 'flow_count' has external linkage, thus satisfying the conditions for static storage duration.

Right. Well, for C99 and C11 at least. For C89 things are slightly different:

6.1.2.4 Storage durations of objects

[...]

An object whose identifier is declared with external or internal linkage. or with the storage-class specifier static has static storage duration.

[...]

An object whose identifier is declared with no linkage and without the storage-class specifier static has automatic storage duration.

You might say it has external linkage. But it was not *declared with* external linkage -- it just happens to have it (C89 and C99 don't differ here).

Hrm. We do have: extern unsigned flow_first_free; in flow_table.h. Does that cound as declaring with external linkage?

Gosh, sorry... yes, it also counts as me completely missing it.

...

...
...
Fwiw, I'm pretty sure the kernel has relied on zero-initialization of non-static globals for many years.

True, and the opposite is even considered as a style issue since 2007, commit f0a594c1c74f ("update checkpatch.pl to version 0.08"). I also found a discussion similar to this one: https://lore.kernel.org/all/20201102184147.GA42288@localhost/#r

Anyway... a couple of years before that, it must have been a gcc version in the late 2.x, I actually hit an issue with it. Was it a compiler issue, or the correct interpretation of C89? Or maybe something on the lines of: https://www.thegoodpenguin.co.uk/blog/u-boot-relocation-bss-hang/

If it was an embedded setup, that last one is certainly possible. Zeroing the BSS is typically the loader's job, and I've certainly seen loader implementations - particularly in embedded firmware - that got this wrong.

Embedded setup, yes, but in a Linux kernel (early 2.4.x). No 'extern' there, though. -- Stefano

David Gibson

7 Jan 7 Jan

6:23 a.m.

New subject: [PATCH v3 10/13] flow: Move flow_count from context structure to a global

On Fri, Jan 05, 2024 at 08:55:13AM +0100, Stefano Brivio wrote:

...

On Thu, 4 Jan 2024 20:51:19 +1100 David Gibson wrote:

...
On Wed, Jan 03, 2024 at 08:08:34AM +0100, Stefano Brivio wrote:

...
On Wed, 3 Jan 2024 14:54:27 +1100 David Gibson wrote:

...
I'm not sure where to get the actual text of the standards

Let me answer this first: one (the?) trick is to use so-called final drafts, which are made freely available (same as working drafts) by the Working Group.

Those are not the same as the standards, but differences from the final draft are also published... and they are usually not substantial.

This is _very_ informative: https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c...

Ah, thanks.

...
Wikipedia also has the links, by the way. Anyway, in practice:

- C11: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf - C99: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf - C89: https://web.archive.org/web/20200909074736if_/https://www.pdf-archive.com/20...

...
On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:

...
On Sun, 31 Dec 2023 16:58:39 +1100 David Gibson wrote:

...
On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote: > On Thu, 21 Dec 2023 17:15:46 +1100 > David Gibson wrote: > > > In general, the passt code is a bit haphazard about what's a true global > > variable and what's in the quasi-global 'context structure'. The > > flow_count field is one such example: it's in the context structure, > > although it's really part of the same data structure as flowtab[], which > > is a genuine global. > > Well, the reason is that flow_tab[FLOW_MAX] might be problematically > too big to live on the stack, unlike flow_count. > > But anyway, as far as thoughts of multithreading are concerned, both > should probably be global. And sure, it's more consistent this way. > > > Move flow_count to be a regular global to match. For now it needs to be > > public, rather than static, but we expect to be able to change that in > > future. > > If it's not static, it should be initialised, and that's not done here.

Uh... what? "static" here is meaning module-global rather than global-global, which has no bearing on initialisation. AFAIK globals are zero-initialised whether they're static or not.

...and to my utter surprise, I just discovered that if you talk C11, you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static or thread storage duration is not initialized explicitly, then:

[...]

— if it has arithmetic type, it is initialized to (positive or unsigned) zero;

And 'flow_count' has thread storage duration.

No.. I don't think it does. AFAICT only thread-local variables have thread storage duration. As a global flow_count will have static storage duration, even without the static keyword.

So, C11 defines static storage duration here:

6.2.4 Storage durations of objects

[...]

3 An object whose identifier is declared without the storage-class specifier _Thread_local, and either with external or internal linkage or with the storage-class specifier static, has static storage duration. Its lifetime is the entire execution of the program and its stored value is initialized only once, prior to program startup.

do we have any linkage here? I would have said no -- but, going back to C99 for this, "6.2.2 Linkages of identifiers":

5 [...] If the declaration of an identifier for an object has file scope and no storage-class specifier, its linkage is external.

which supports your paragraph below.

Right.

...
By the way, C11 now says:

6.11.2 Linkages of identifiers

1 Declaring an identifier with internal linkage at file scope without the static storage-class specifier is an obsolescent feature

Ok. I'm not even sure how you would do that.

By doing what I *thought* you were doing (see below): "int x" at file scope (outside functions), no static, no extern declaration, nothing.

Oh, ok. Even without that, I think the behaviour has to be the same. Plain "int x" at file scope is a public variable, and a linked module could "extern" it, even if that extern isn't visible while we're compiling this module.

...

...
...
...
...
In C99, however (draft N1256), Section 6.7.8 "Initialization", clause 10:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:

[...]

note the missing "or thread storage duration".

C89, the one I was actually basing my observation on, says, at 3.5.7 "Initialization":

If an object that has static storage duration is not initialized explicitly, it is initialized implicitly as if every member that has arithmetic type were assigned 0 and every member that has pointer type were assigned a null pointer constant. If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate.

so... um. We won't go back to C99. But to me, and maybe others, not having a "= 0;" for a "global" means pretty much that we don't rely on any particular initial value.

Again, I'm pretty sure that's not true, even for C99 and C89. AIUI, 'static' locals and *all* globals have "static storage diration".

I'm not sure where to get the actual text of the standards but see for example

https://en.cppreference.com/w/c/language/static_storage_duration

Here 'flow_count' has external linkage, thus satisfying the conditions for static storage duration.

Right. Well, for C99 and C11 at least. For C89 things are slightly different:

6.1.2.4 Storage durations of objects

[...]

An object whose identifier is declared with external or internal linkage. or with the storage-class specifier static has static storage duration.

[...]

An object whose identifier is declared with no linkage and without the storage-class specifier static has automatic storage duration.

You might say it has external linkage. But it was not *declared with* external linkage -- it just happens to have it (C89 and C99 don't differ here).

Hrm. We do have: extern unsigned flow_first_free; in flow_table.h. Does that cound as declaring with external linkage?

Gosh, sorry... yes, it also counts as me completely missing it.

...
...
...
Fwiw, I'm pretty sure the kernel has relied on zero-initialization of non-static globals for many years.

True, and the opposite is even considered as a style issue since 2007, commit f0a594c1c74f ("update checkpatch.pl to version 0.08"). I also found a discussion similar to this one: https://lore.kernel.org/all/20201102184147.GA42288@localhost/#r

Anyway... a couple of years before that, it must have been a gcc version in the late 2.x, I actually hit an issue with it. Was it a compiler issue, or the correct interpretation of C89? Or maybe something on the lines of: https://www.thegoodpenguin.co.uk/blog/u-boot-relocation-bss-hang/

If it was an embedded setup, that last one is certainly possible. Zeroing the BSS is typically the loader's job, and I've certainly seen loader implementations - particularly in embedded firmware - that got this wrong.

Embedded setup, yes, but in a Linux kernel (early 2.4.x). No 'extern' there, though.

Right. Depending on platform, the kernel sometimes has a loader for a later stage so does need to clear BSS. Or it might need to as a workaround for a broken firmware loader. I'm pretty sure I've also seem Linux bugs on some embedded platforms that involved failing to clear the BSS. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

21 Dec 21 Dec

7:15 a.m.

New subject: [PATCH v3 11/13] flow: Abstract allocation of new flows with helper function

Currently tcp.c open codes the process of allocating a new flow from the flow table: twice, in fact, once for guest to host and once for host to guest connections. This duplication isn't ideal and will get worse as we add more protocols to the flow table. It also makes it harder to experiment with different ways of handling flow table allocation. Instead, introduce a function to allocate a new flow: flow_alloc(). In some cases we currently check if we're able to allocate, but delay the actual allocation. We now handle that slightly differently with a flow_alloc_cancel() function to back out a recent allocation. We have that separate from a flow_free() function, because future changes we have in mind will need to handle this case a little differently. Signed-off-by: David Gibson --- flow.c | 26 ++++++++++++++++++++++++++ flow_table.h | 3 +++ tcp.c | 29 ++++++++++++++++++----------- 3 files changed, 47 insertions(+), 11 deletions(-) diff --git a/flow.c b/flow.c index 294412a..8fbe512 100644 --- a/flow.c +++ b/flow.c @@ -49,6 +49,32 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } +/** + * flow_alloc() - Allocate a new flow + * + * Return: pointer to an unused flow entry, or NULL if the table is full + */ +union flow *flow_alloc(void) +{ + if (flow_count >= FLOW_MAX) + return NULL; + + return &flowtab[flow_count++]; +} + +/** + * flow_alloc_cancel() - Free a newly allocated flow + * @flow: Flow to deallocate + * + * @flow must be the last flow allocated by flow_alloc() + */ +void flow_alloc_cancel(union flow *flow) +{ + ASSERT(FLOW_IDX(flow) == flow_count - 1); + memset(flow, 0, sizeof(*flow)); + flow_count--; +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context diff --git a/flow_table.h b/flow_table.h index 4aa2398..2773a2b 100644 --- a/flow_table.h +++ b/flow_table.h @@ -88,4 +88,7 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f, */ #define FLOW_SIDX(f_, side) (flow_sidx(&(f_)->f, (side))) +union flow *flow_alloc(void); +void flow_alloc_cancel(union flow *flow); + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 6aefd0b..6477203 100644 --- a/tcp.c +++ b/tcp.c @@ -1943,17 +1943,18 @@ static void tcp_conn_from_tap(struct ctx *c, }; const struct sockaddr *sa; struct tcp_tap_conn *conn; + union flow *flow; socklen_t sl; int s, mss; (void)saddr; - if (flow_count >= FLOW_MAX) + if (!(flow = flow_alloc())) return; if ((s = tcp_conn_pool_sock(pool)) < 0) if ((s = tcp_conn_new_sock(c, af)) < 0) - return; + goto cancel; if (!c->no_map_gw) { if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) @@ -1968,13 +1969,11 @@ static void tcp_conn_from_tap(struct ctx *c, .sin6_addr = c->ip6.addr_ll, .sin6_scope_id = c->ifi6, }; - if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) { - close(s); - return; - } + if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) + goto cancel; } - conn = CONN(flow_count++); + conn = &flow->tcp; conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; @@ -2046,6 +2045,12 @@ static void tcp_conn_from_tap(struct ctx *c, } tcp_epoll_ctl(c, conn); + return; + +cancel: + if (s >= 0) + close(s); + flow_alloc_cancel(flow); } /** @@ -2723,14 +2728,12 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, union flow *flow; int s; - if (c->no_tcp || flow_count >= FLOW_MAX) + if (c->no_tcp || !(flow = flow_alloc())) return; s = accept4(ref.fd, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK); if (s < 0) - return; - - flow = flowtab + flow_count++; + goto cancel; if (c->mode == MODE_PASTA && tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, @@ -2739,6 +2742,10 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, (struct sockaddr *)&sa, now); + return; + +cancel: + flow_alloc_cancel(flow); } /** -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 12/13] flow: Enforce that freeing of closed flows must happen in deferred handlers

Currently, flows are only evern finally freed (and the table compacted) from the deferred handlers. Some future ways we want to optimise managing the flow table will rely on this, so enforce it: rather than having the TCP code directly call flow_table_compact(), add a boolean return value to the per-flow deferred handlers. If true, this indicates that the flow code itself should free the flow. This forces all freeing of flows to occur during the flow code's scan of the table in flow_defer_handler() which opens possibilities for future optimisations. Signed-off-by: David Gibson --- flow.c | 13 +++++++++---- flow.h | 1 - tcp.c | 9 +++++---- tcp_conn.h | 4 ++-- tcp_splice.c | 9 +++++---- 5 files changed, 21 insertions(+), 15 deletions(-) diff --git a/flow.c b/flow.c index 8fbe512..7b99b58 100644 --- a/flow.c +++ b/flow.c @@ -80,7 +80,7 @@ void flow_alloc_cancel(union flow *flow) * @c: Execution context * @hole: Pointer to recently closed flow */ -void flow_table_compact(const struct ctx *c, union flow *hole) +static void flow_table_compact(const struct ctx *c, union flow *hole) { union flow *from; @@ -130,18 +130,23 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) } for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { + bool closed = false; + switch (flow->f.type) { case FLOW_TCP: - tcp_flow_defer(c, flow); + closed = tcp_flow_defer(flow); break; case FLOW_TCP_SPLICE: - tcp_splice_flow_defer(c, flow); - if (timer) + closed = tcp_splice_flow_defer(flow); + if (!closed && timer) tcp_splice_timer(c, flow); break; default: /* Assume other flow types don't need any handling */ ; } + + if (closed) + flow_table_compact(c, flow); } } diff --git a/flow.h b/flow.h index 44058bf..8064f0e 100644 --- a/flow.h +++ b/flow.h @@ -68,7 +68,6 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow; -void flow_table_compact(const struct ctx *c, union flow *hole); void flow_defer_handler(const struct ctx *c, const struct timespec *now); void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) diff --git a/tcp.c b/tcp.c index 6477203..a38deb0 100644 --- a/tcp.c +++ b/tcp.c @@ -1303,21 +1303,22 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, /** * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections) - * @c: Execution context * @flow: Flow table entry for this connection + * + * Return: true if the flow is ready to free, false otherwise */ -void tcp_flow_defer(const struct ctx *c, union flow *flow) +bool tcp_flow_defer(union flow *flow) { const struct tcp_tap_conn *conn = &flow->tcp; if (flow->tcp.events != CLOSED) - return; + return false; close(conn->sock); if (conn->timer != -1) close(conn->timer); - flow_table_compact(c, flow); + return true; } static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); diff --git a/tcp_conn.h b/tcp_conn.h index 825155a..636224e 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -158,8 +158,8 @@ extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, struct tcp_tap_conn *new); void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); -void tcp_flow_defer(const struct ctx *c, union flow *flow); -void tcp_splice_flow_defer(const struct ctx *c, union flow *flow); +bool tcp_flow_defer(union flow *flow); +bool tcp_splice_flow_defer(union flow *flow); void tcp_splice_timer(const struct ctx *c, union flow *flow); int tcp_conn_pool_sock(int pool[]); int tcp_conn_new_sock(const struct ctx *c, sa_family_t af); diff --git a/tcp_splice.c b/tcp_splice.c index 18cd2e6..0711b19 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -244,16 +244,17 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) /** * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed) - * @c: Execution context * @flow: Flow table entry for this connection + * + * Return: true if the flow is ready to free, false otherwise */ -void tcp_splice_flow_defer(const struct ctx *c, union flow *flow) +bool tcp_splice_flow_defer(union flow *flow) { struct tcp_splice_conn *conn = &flow->tcp_splice; unsigned side; if (!(flow->tcp_splice.flags & CLOSING)) - return; + return false; for (side = 0; side < SIDES; side++) { if (conn->events & SPLICE_ESTABLISHED) { @@ -277,7 +278,7 @@ void tcp_splice_flow_defer(const struct ctx *c, union flow *flow) conn->flags = 0; flow_dbg(conn, "CLOSED"); - flow_table_compact(c, flow); + return true; } /** -- 2.43.0

David Gibson

7:15 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson --- flow.c | 177 +++++++++++++++++++++++++++++++++++++-------------- flow.h | 1 + flow_table.h | 16 ++++- passt.c | 2 + tcp.c | 23 ------- tcp_conn.h | 3 - tcp_splice.c | 11 ---- 7 files changed, 146 insertions(+), 87 deletions(-) diff --git a/flow.c b/flow.c index 7b99b58..421e6b5 100644 --- a/flow.c +++ b/flow.c @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); /* Global Flow Table */ -unsigned flow_count; +unsigned flow_first_free; union flow flowtab[FLOW_MAX]; /* Last time the flow timers ran */ @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } +/** + * DOC: Theory of Operation - allocation and freeing of flow entries + * + * Each flow takes a single slot in flowtab[]. Moving entries in that table + * (which we used to do) is fiddly and possibly expensive: it requires updating + * the hash table indexing flows, and may require updating epoll data which + * references the flow by index. However, we also want to keep the active + * entries in the table compact where possible, because otherwise scanning + * through the entire table becomes more expensive. This describes the + * compromise implemented below. + * + * Free blocks + * A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries + * in flowtab. The first entry in each block contains metadata, specifically + * the number of entries in the block, and the index of the next (non + * contiguous) free block (struct flow_free_block). + * + * Free block list + * flow_first_free gives the index of the first entry of the first (lowest + * index) free block. Each free block has the index of the next free block, + * or MAX_FLOW if it is the last free block. Together these form a linked + * list of free blocks, in strictly increasing order of index. + * + * Allocation + * When allocating a new flow, we always use the first entry of the first + * free block, that is, at index flow_first_free. If the block has more than + * one entry, flow_first_free is updated to the next entry, which is updated + * to represent the new smaller free block. Otherwise the free block is + * eliminated and flow_first_free is updated to the next free block. + * + * Scanning the table + * Theoretically, scanning the table requires FLOW_MAX iterations. However, + * when we encounter the start of a free block, we can immediately skip to + * its end, meaning that in practice we only need (number of active + * connections) + (number of free blocks) iterations. + * + * Freeing + * We can only free entries when scanning the whole flow table in + * flow_defer_handler(). This is what lets us maintain the fee block list in + * index sorted order. As we scan we keep track of whether the previous + * entry was in a free block or not. If so when an entry is freed (its + * deferred handler returns 'true'), we add it to that free block. Otherwise + * we create a new free block for it and chain it to the last free block we + * scanned. + */ + /** * flow_alloc() - Allocate a new flow * @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ union flow *flow_alloc(void) { - if (flow_count >= FLOW_MAX) + union flow *flow = &flowtab[flow_first_free]; + + if (flow_first_free >= FLOW_MAX) return NULL; - return &flowtab[flow_count++]; + ASSERT(flow->f.type == FLOW_TYPE_NONE); + ASSERT(flow->free.n >= 1); + + if (flow->free.n > 1) { + /* Use one entry from the block */ + union flow *next = &flowtab[++flow_first_free]; + + ASSERT(FLOW_IDX(next) < FLOW_MAX); + ASSERT(next->f.type == FLOW_TYPE_NONE); + ASSERT(next->free.n == 0); + + next->free.n = flow->free.n - 1; + next->free.next = flow->free.next; + } else { + /* Use the entire block */ + flow_first_free = flow->free.next; + } + + memset(flow, 0, sizeof(*flow)); + return flow; } /** @@ -70,48 +137,15 @@ union flow *flow_alloc(void) */ void flow_alloc_cancel(union flow *flow) { - ASSERT(FLOW_IDX(flow) == flow_count - 1); - memset(flow, 0, sizeof(*flow)); - flow_count--; -} - -/** - * flow_table_compact() - Perform compaction on flow table - * @c: Execution context - * @hole: Pointer to recently closed flow - */ -static void flow_table_compact(const struct ctx *c, union flow *hole) -{ - union flow *from; - - if (FLOW_IDX(hole) == --flow_count) { - debug("flow: table compaction: maximum index was %u (%p)", - FLOW_IDX(hole), (void *)hole); - memset(hole, 0, sizeof(*hole)); - return; - } - - from = flowtab + flow_count; - memcpy(hole, from, sizeof(*hole)); - - switch (from->f.type) { - case FLOW_TCP: - tcp_tap_conn_update(c, &from->tcp, &hole->tcp); - break; - case FLOW_TCP_SPLICE: - tcp_splice_conn_update(c, &hole->tcp_splice); - break; - default: - die("Unexpected %s in tcp_table_compact()", - FLOW_TYPE(&from->f)); - } - - debug("flow: table compaction (%s): old index %u, new index %u, " - "from: %p, to: %p", - FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), - (void *)from, (void *)hole); - - memset(from, 0, sizeof(*from)); + ASSERT(flow_first_free > FLOW_IDX(flow)); + + flow->f.type = FLOW_TYPE_NONE; + /* Put it back in a length 1 free block, don't attempt to fully reverse + * flow_alloc()s steps. This will get folded together the next time + * flow_defer_handler runs anyway() */ + flow->free.n = 1; + flow->free.next = flow_first_free; + flow_first_free = FLOW_IDX(flow); } /** @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole) */ void flow_defer_handler(const struct ctx *c, const struct timespec *now) { + struct flow_free_block *free_head = NULL; + unsigned *last_next = &flow_first_free; bool timer = false; - union flow *flow; + unsigned idx; if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) { timer = true; flow_timer_run = *now; } - for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { + for (idx = 0; idx < FLOW_MAX; idx++) { + union flow *flow = &flowtab[idx]; bool closed = false; + if (flow->f.type == FLOW_TYPE_NONE) { + /* Start of a free block */ + free_head = &flow->free; + *last_next = idx; + last_next = &free_head->next; + /* Skip the rest of the block */ + idx += free_head->n - 1; + continue; + } + + switch (flow->f.type) { + case FLOW_TYPE_NONE: + closed = true; + break; case FLOW_TCP: closed = tcp_flow_defer(flow); break; @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) ; } - if (closed) - flow_table_compact(c, flow); + if (closed) { + flow->f.type = FLOW_TYPE_NONE; + + if (free_head) { + /* Add slot to current free block */ + ASSERT(idx == FLOW_IDX(free_head) + free_head->n); + free_head->n++; + flow->free.n = flow->free.next = 0; + } else { + /* Create new free block */ + free_head = &flow->free; + free_head->n = 1; + *last_next = idx; + last_next = &free_head->next; + } + } else { + free_head = NULL; + } } + + *last_next = FLOW_MAX; +} + +/** + * flow_init() - Initialise flow related data structures + */ +void flow_init(void) +{ + /* Initial state is a single free block containing the whole table */ + flowtab[0].free.n = FLOW_MAX; + flowtab[0].free.next = FLOW_MAX; } diff --git a/flow.h b/flow.h index 8064f0e..48a0ab4 100644 --- a/flow.h +++ b/flow.h @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow; +void flow_init(void); void flow_defer_handler(const struct ctx *c, const struct timespec *now); void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) diff --git a/flow_table.h b/flow_table.h index 2773a2b..34fc679 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,19 @@ #include "tcp_conn.h" +/** + * struct flow_free_block - Information about a block of free entries + * @f: Generic flow information + * @n: Number of entries in this free block (including this one) + * @next: Index of next free block + */ +struct flow_free_block { + /* Must be first element */ + struct flow_common f; + unsigned n; + unsigned next; +}; + /** * union flow - Descriptor for a logical packet flow (e.g. connection) * @f: Fields common between all variants @@ -17,12 +30,13 @@ */ union flow { struct flow_common f; + struct flow_free_block free; struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; }; /* Global Flow Table */ -extern unsigned flow_count; +extern unsigned flow_first_free; extern union flow flowtab[]; diff --git a/passt.c b/passt.c index 71bea8f..d315438 100644 --- a/passt.c +++ b/passt.c @@ -285,6 +285,8 @@ int main(int argc, char **argv) clock_gettime(CLOCK_MONOTONIC, &now); + flow_init(); + if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) exit(EXIT_FAILURE); diff --git a/tcp.c b/tcp.c index a38deb0..6aacb56 100644 --- a/tcp.c +++ b/tcp.c @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c, tc_hash[b] = FLOW_SIDX_NONE; } -/** - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table - * @c: Execution context - * @old: Old location of tcp_tap_conn - * @new: New location of tcp_tap_conn - */ -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new) - -{ - unsigned b = tcp_hash_probe(c, old); - - if (!flow_at_sidx(tc_hash[b])) - return; /* Not in hash table, nothing to update */ - - tc_hash[b] = FLOW_SIDX(new, TAPSIDE); - - debug("TCP: hash table update: old index %u, new index %u, sock %i, " - "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b); - - tcp_epoll_ctl(c, new); -} - /** * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context diff --git a/tcp_conn.h b/tcp_conn.h index 636224e..a5f5cfe 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -155,9 +155,6 @@ struct tcp_splice_conn { extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new); -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); bool tcp_flow_defer(union flow *flow); bool tcp_splice_flow_defer(union flow *flow); void tcp_splice_timer(const struct ctx *c, union flow *flow); diff --git a/tcp_splice.c b/tcp_splice.c index 0711b19..7aae623 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, } while (0) -/** - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table - * @c: Execution context - * @new: New location of tcp_splice_conn - */ -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) -{ - if (tcp_splice_epoll_ctl(c, new)) - conn_flag(c, new, CLOSING); -} - /** * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed) * @flow: Flow table entry for this connection -- 2.43.0

Stefano Brivio

28 Dec 28 Dec

7:25 p.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

...

Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too.

Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details.

Signed-off-by: David Gibson --- flow.c | 177 +++++++++++++++++++++++++++++++++++++-------------- flow.h | 1 + flow_table.h | 16 ++++- passt.c | 2 + tcp.c | 23 ------- tcp_conn.h | 3 - tcp_splice.c | 11 ---- 7 files changed, 146 insertions(+), 87 deletions(-)

diff --git a/flow.c b/flow.c index 7b99b58..421e6b5 100644 --- a/flow.c +++ b/flow.c @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type");

/* Global Flow Table */ -unsigned flow_count; +unsigned flow_first_free;

As mentioned in the comment to 10/13: this should be initialised to zero.

...

union flow flowtab[FLOW_MAX];

/* Last time the flow timers ran */ @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); }

+/**

As a general comment: interesting -- I wonder if this already has a name, but I couldn't find anything similar in the literature. Perhaps it's equivalent to a flattened search tree where free/busy slots can be found by following the links.

...

+ * DOC: Theory of Operation - allocation and freeing of flow entries

"allocating and freeing flow entries" ...flows better for me. I think this comment should be before the declaration of flowtab[] -- above here, with a function in between, is not exactly where I expected to find it as I started reading the next paragraph.

...

+ * + * Each flow takes a single slot in flowtab[]. Moving entries in that table

You jump right away to "moving entries", we know why, but I guess any other reader would be lost at this point. In general, it took me some effort to grasp this paragraph. What about: * Flows are entries in the flowtab[]. We need to routinely scan the whole table * to perform deferred bookkeeping tasks on active entries, and sparse empty * slots waste time and worsen data locality. * * But keeping the table compacted by moving entries on deletion is fiddly: it * requires updating hash tables, and the epoll references to the flows. Instead, * we implement the compromise described below. ?

...

+ * (which we used to do) is fiddly and possibly expensive: it requires updating + * the hash table indexing flows, and may require updating epoll data which + * references the flow by index. However, we also want to keep the active + * entries in the table compact where possible, because otherwise scanning + * through the entire table becomes more expensive. This describes the + * compromise implemented below. + * + * Free blocks + * A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries

This gave me the idea that blocks are some kind of "bucket" with a fixed size. I think we should either mention that the blocks have variable size, or call them sequences, or clusters (of free slots) perhaps?

...

+ * in flowtab. The first entry in each block contains metadata, specifically + * the number of entries in the block, and the index of the next (non + * contiguous) free block (struct flow_free_block).

...it doesn't waste any space, but I don't think we actually need to store the number of "entries" (slots?)... more details below.

...

+ * + * Free block list + * flow_first_free gives the index of the first entry of the first (lowest + * index) free block. Each free block has the index of the next free block, + * or MAX_FLOW if it is the last free block. Together these form a linked + * list of free blocks, in strictly increasing order of index. + * + * Allocation + * When allocating a new flow, we always use the first entry of the first + * free block, that is, at index flow_first_free. If the block has more than + * one entry, flow_first_free is updated to the next entry, which is updated + * to represent the new smaller free block. Otherwise the free block is + * eliminated and flow_first_free is updated to the next free block. + * + * Scanning the table + * Theoretically, scanning the table requires FLOW_MAX iterations. However, + * when we encounter the start of a free block, we can immediately skip to + * its end, meaning that in practice we only need (number of active

after its end

...

+ * connections) + (number of free blocks) iterations. + * + * Freeing + * We can only free entries when scanning the whole flow table in + * flow_defer_handler(). This is what lets us maintain the fee block list in

s/fee/free/

...

+ * index sorted order. As we scan we keep track of whether the previous + * entry was in a free block or not. If so when an entry is freed (its + * deferred handler returns 'true'), we add it to that free block. Otherwise + * we create a new free block for it and chain it to the last free block we + * scanned. + */ + /** * flow_alloc() - Allocate a new flow * @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ union flow *flow_alloc(void) { - if (flow_count >= FLOW_MAX) + union flow *flow = &flowtab[flow_first_free]; + + if (flow_first_free >= FLOW_MAX) return NULL;

- return &flowtab[flow_count++]; + ASSERT(flow->f.type == FLOW_TYPE_NONE); + ASSERT(flow->free.n >= 1); + + if (flow->free.n > 1) { + /* Use one entry from the block */ + union flow *next = &flowtab[++flow_first_free];

...but we just checked (flow_first_free < FLOW_MAX) above. Now you increment flow_first_free by one. I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n should be 0, so we shouldn't end up in this case -- but it's a bit confusing.

...

+ + ASSERT(FLOW_IDX(next) < FLOW_MAX); + ASSERT(next->f.type == FLOW_TYPE_NONE); + ASSERT(next->free.n == 0); + + next->free.n = flow->free.n - 1; + next->free.next = flow->free.next; + } else { + /* Use the entire block */ + flow_first_free = flow->free.next; + }

I wonder if we really have to keep track of the number of (non-)entries in the free "block", and if we have to be explicit about the two cases. I'm trying to find out if we can simplify the whole thing with slightly different variants, for example: struct flow_free_block { [...] unsigned next_free; unsigned next_busy; }; or: struct flow_free_block { [...] unsigned next_free; }; or even some sort of double-linked list: struct flow_free_block { [...] unsigned prev_free; unsigned next_free; }; but I couldn't quite find a more elegant solution yet.

...

+ + memset(flow, 0, sizeof(*flow)); + return flow; }

/** @@ -70,48 +137,15 @@ union flow *flow_alloc(void) */ void flow_alloc_cancel(union flow *flow) { - ASSERT(FLOW_IDX(flow) == flow_count - 1); - memset(flow, 0, sizeof(*flow)); - flow_count--; -} - -/** - * flow_table_compact() - Perform compaction on flow table - * @c: Execution context - * @hole: Pointer to recently closed flow - */ -static void flow_table_compact(const struct ctx *c, union flow *hole) -{ - union flow *from; - - if (FLOW_IDX(hole) == --flow_count) { - debug("flow: table compaction: maximum index was %u (%p)", - FLOW_IDX(hole), (void *)hole); - memset(hole, 0, sizeof(*hole)); - return; - } - - from = flowtab + flow_count; - memcpy(hole, from, sizeof(*hole)); - - switch (from->f.type) { - case FLOW_TCP: - tcp_tap_conn_update(c, &from->tcp, &hole->tcp); - break; - case FLOW_TCP_SPLICE: - tcp_splice_conn_update(c, &hole->tcp_splice); - break; - default: - die("Unexpected %s in tcp_table_compact()", - FLOW_TYPE(&from->f)); - } - - debug("flow: table compaction (%s): old index %u, new index %u, " - "from: %p, to: %p", - FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), - (void *)from, (void *)hole); - - memset(from, 0, sizeof(*from)); + ASSERT(flow_first_free > FLOW_IDX(flow)); + + flow->f.type = FLOW_TYPE_NONE; + /* Put it back in a length 1 free block, don't attempt to fully reverse + * flow_alloc()s steps. This will get folded together the next time + * flow_defer_handler runs anyway() */ + flow->free.n = 1; + flow->free.next = flow_first_free; + flow_first_free = FLOW_IDX(flow); }

/** @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole) */ void flow_defer_handler(const struct ctx *c, const struct timespec *now) { + struct flow_free_block *free_head = NULL; + unsigned *last_next = &flow_first_free; bool timer = false; - union flow *flow; + unsigned idx;

if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) { timer = true; flow_timer_run = *now; }

- for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { + for (idx = 0; idx < FLOW_MAX; idx++) { + union flow *flow = &flowtab[idx]; bool closed = false;

+ if (flow->f.type == FLOW_TYPE_NONE) { + /* Start of a free block */ + free_head = &flow->free; + *last_next = idx; + last_next = &free_head->next; + /* Skip the rest of the block */ + idx += free_head->n - 1; + continue; + } + +

Stray tabs.

...

switch (flow->f.type) { + case FLOW_TYPE_NONE: + closed = true; + break; case FLOW_TCP: closed = tcp_flow_defer(flow); break; @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) ; }

- if (closed) - flow_table_compact(c, flow); + if (closed) {

Should this be a flow_del() function perhaps, possibly taking flow and free_head as argument?

...

+ flow->f.type = FLOW_TYPE_NONE; + + if (free_head) { + /* Add slot to current free block */ + ASSERT(idx == FLOW_IDX(free_head) + free_head->n); + free_head->n++; + flow->free.n = flow->free.next = 0; + } else { + /* Create new free block */ + free_head = &flow->free; + free_head->n = 1; + *last_next = idx; + last_next = &free_head->next; + } + } else { + free_head = NULL; + } } + + *last_next = FLOW_MAX; +} + +/** + * flow_init() - Initialise flow related data structures + */ +void flow_init(void) +{ + /* Initial state is a single free block containing the whole table */ + flowtab[0].free.n = FLOW_MAX; + flowtab[0].free.next = FLOW_MAX; } diff --git a/flow.h b/flow.h index 8064f0e..48a0ab4 100644 --- a/flow.h +++ b/flow.h @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)

union flow;

+void flow_init(void); void flow_defer_handler(const struct ctx *c, const struct timespec *now);

void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) diff --git a/flow_table.h b/flow_table.h index 2773a2b..34fc679 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,19 @@

#include "tcp_conn.h"

+/** + * struct flow_free_block - Information about a block of free entries + * @f: Generic flow information + * @n: Number of entries in this free block (including this one) + * @next: Index of next free block + */ +struct flow_free_block { + /* Must be first element */ + struct flow_common f; + unsigned n; + unsigned next; +}; + /** * union flow - Descriptor for a logical packet flow (e.g. connection) * @f: Fields common between all variants @@ -17,12 +30,13 @@ */ union flow { struct flow_common f; + struct flow_free_block free; struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; };

/* Global Flow Table */ -extern unsigned flow_count; +extern unsigned flow_first_free; extern union flow flowtab[];

diff --git a/passt.c b/passt.c index 71bea8f..d315438 100644 --- a/passt.c +++ b/passt.c @@ -285,6 +285,8 @@ int main(int argc, char **argv)

clock_gettime(CLOCK_MONOTONIC, &now);

+ flow_init(); + if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) exit(EXIT_FAILURE);

diff --git a/tcp.c b/tcp.c index a38deb0..6aacb56 100644 --- a/tcp.c +++ b/tcp.c @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c, tc_hash[b] = FLOW_SIDX_NONE; }

-/** - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table - * @c: Execution context - * @old: Old location of tcp_tap_conn - * @new: New location of tcp_tap_conn - */ -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new) - -{ - unsigned b = tcp_hash_probe(c, old); - - if (!flow_at_sidx(tc_hash[b])) - return; /* Not in hash table, nothing to update */ - - tc_hash[b] = FLOW_SIDX(new, TAPSIDE); - - debug("TCP: hash table update: old index %u, new index %u, sock %i, " - "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b); - - tcp_epoll_ctl(c, new); -} - /** * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context diff --git a/tcp_conn.h b/tcp_conn.h index 636224e..a5f5cfe 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -155,9 +155,6 @@ struct tcp_splice_conn { extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE];

-void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new); -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); bool tcp_flow_defer(union flow *flow); bool tcp_splice_flow_defer(union flow *flow); void tcp_splice_timer(const struct ctx *c, union flow *flow); diff --git a/tcp_splice.c b/tcp_splice.c index 0711b19..7aae623 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, } while (0)

-/** - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table - * @c: Execution context - * @new: New location of tcp_splice_conn - */ -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) -{ - if (tcp_splice_epoll_ctl(c, new)) - conn_flag(c, new, CLOSING); -} - /** * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed) * @flow: Flow table entry for this connection

-- Stefano

Stefano Brivio

30 Dec 30 Dec

11:33 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Thu, 28 Dec 2023 19:25:25 +0100 Stefano Brivio wrote:

...

...
On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

[...]

[...]

I wonder if we really have to keep track of the number of (non-)entries in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly different variants, for example:

So... I think the version with (explicit) blocks has this fundamental advantage, on deletion:

...

...
+ flow->f.type = FLOW_TYPE_NONE; + /* Put it back in a length 1 free block, don't attempt to fully reverse + * flow_alloc()s steps. This will get folded together the next time + * flow_defer_handler runs anyway() */ + flow->free.n = 1; + flow->free.next = flow_first_free; + flow_first_free = FLOW_IDX(flow);

which is doable even without explicit blocks, but much harder to follow. Other than the comments I have, I would go ahead with your approach. My attempt below in case you're interested. The comment at the end of flow_del() shows the issue I'm facing: struct flow_free_block { /* Must be first element */ struct flow_common f; unsigned next_free; unsigned next_used; }; static union flow *flow_alloc(void) { union flow *flow; if (flow_first_free >= FLOW_MAX) return NULL; flow = flowtab + flow_first_free; flow_first_free = flow->free.next_free; memset(flow, 0, sizeof(*flow)); /* TODO: select region */ return flow; } static void flow_del(union flow *del) { union flow *left, *right, *next_used = NULL, *first_free; del->f.type = FLOW_TYPE_NONE; left = (FLOW_IDX(del) > 0) ? FLOW(FLOW_IDX(del) - 1) : NULL; right = (FLOW_IDX(del) < FLOW_MAX - 1) ? FLOW(FLOW_IDX(del) + 1) : NULL; first_free = flow_first_free < FLOW_MAX ? FLOW(flow_first_free) : NULL; if (right) { if (right->f.type == FLOW_TYPE_NONE) del->free.next_used = right->free.next_used; else del->free.next_used = right; } else { del->free.next_used = FLOW_MAX; } if (left && left->f.type == FLOW_TYPE_NONE) { left->free.next_free = FLOW_IDX(del); left->free.next_used = del->free.next_used; return; } if (flow_first_free == FLOW_MAX) { flow_first_free = FLOW_IDX(del); } else if (flow_first_free > FLOW_IDX(del)) { flow_first_free = FLOW_IDX(del); del->free.next_free = flow_first_free; } else if (flow_first_free < FLOW_IDX(del)) { ; /* Walk free slots from flow_first_free until FLOW_IDX(del) to * find insertion point... but that makes deletion at most O(n), * perhaps O(log(n)), certainly not O(1). * * Or just link out-of-order for the moment, and fix this up in * flow_next(), but then "blocks" help representing this. */ } } static union flow *flow_next(union flow *whence) { if (FLOW_IDX(whence) >= FLOW_MAX - 1) return NULL; if (whence->f.type != FLOW_TYPE_NONE) whence++; if (whence->f.type == FLOW_TYPE_NONE) return whence->free.next_used; return whence; } -- Stefano

David Gibson

1 Jan 1 Jan

1:01 p.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:

...

On Thu, 28 Dec 2023 19:25:25 +0100 Stefano Brivio wrote:

...
...
On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

[...]

[...]

I wonder if we really have to keep track of the number of (non-)entries in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly different variants, for example:

So... I think the version with (explicit) blocks has this fundamental advantage, on deletion:

...
...
+ flow->f.type = FLOW_TYPE_NONE; + /* Put it back in a length 1 free block, don't attempt to fully reverse + * flow_alloc()s steps. This will get folded together the next time + * flow_defer_handler runs anyway() */ + flow->free.n = 1; + flow->free.next = flow_first_free; + flow_first_free = FLOW_IDX(flow);

which is doable even without explicit blocks, but much harder to follow.

Remember this is not a general deletion, only a "cancel" of the most recent allocation. To reduce fragmentation we are keeping the linked list of free clusters in strictly ascending order. The logic above is only correct if the entry we're freeing is before any other free entry in the table. That will be the case for the most recent allocation, because we always allocatte the first free entry in the table.

...

Other than the comments I have, I would go ahead with your approach. My attempt below in case you're interested. The comment at the end of flow_del() shows the issue I'm facing:

struct flow_free_block { /* Must be first element */ struct flow_common f;

unsigned next_free; unsigned next_used; };

static union flow *flow_alloc(void) { union flow *flow;

if (flow_first_free >= FLOW_MAX) return NULL;

flow = flowtab + flow_first_free;

flow_first_free = flow->free.next_free;

memset(flow, 0, sizeof(*flow)); /* TODO: select region */ return flow; }

static void flow_del(union flow *del) { union flow *left, *right, *next_used = NULL, *first_free; fg

del->f.type = FLOW_TYPE_NONE;

left = (FLOW_IDX(del) > 0) ? FLOW(FLOW_IDX(del) - 1) : NULL; right = (FLOW_IDX(del) < FLOW_MAX - 1) ? FLOW(FLOW_IDX(del) + 1) : NULL;

first_free = flow_first_free < FLOW_MAX ? FLOW(flow_first_free) : NULL;

if (right) { if (right->f.type == FLOW_TYPE_NONE) del->free.next_used = right->free.next_used; else del->free.next_used = right; } else { del->free.next_used = FLOW_MAX; }

if (left && left->f.type == FLOW_TYPE_NONE) { left->free.next_free = FLOW_IDX(del); left->free.next_used = del->free.next_used; return; }

if (flow_first_free == FLOW_MAX) { flow_first_free = FLOW_IDX(del); } else if (flow_first_free > FLOW_IDX(del)) { flow_first_free = FLOW_IDX(del); del->free.next_free = flow_first_free; } else if (flow_first_free < FLOW_IDX(del)) { ; /* Walk free slots from flow_first_free until FLOW_IDX(del) to * find insertion point... but that makes deletion at most O(n), * perhaps O(log(n)), certainly not O(1).

Exactly. I can't see any way (short of far more complex data structures) around having a linear scan somewhere, although you can kind of choose whether it happens on alloc or free. The idea of my implementation is to have it at free time - but to merge it with an existing linear scan so it won't have an additional cost.

...

* * Or just link out-of-order for the moment, and fix this up in * flow_next(), but then "blocks" help representing this. */ } }

static union flow *flow_next(union flow *whence) { if (FLOW_IDX(whence) >= FLOW_MAX - 1) return NULL;

if (whence->f.type != FLOW_TYPE_NONE) whence++;

if (whence->f.type == FLOW_TYPE_NONE) return whence->free.next_used;

return whence; }

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

2 Jan 2 Jan

7:13 p.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Mon, 1 Jan 2024 23:01:17 +1100 David Gibson wrote:

...

On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:

...
On Thu, 28 Dec 2023 19:25:25 +0100 Stefano Brivio wrote:

...
...
On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

[...]

[...]

I wonder if we really have to keep track of the number of (non-)entries in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly different variants, for example:

So... I think the version with (explicit) blocks has this fundamental advantage, on deletion:

...
...
+ flow->f.type = FLOW_TYPE_NONE; + /* Put it back in a length 1 free block, don't attempt to fully reverse + * flow_alloc()s steps. This will get folded together the next time + * flow_defer_handler runs anyway() */ + flow->free.n = 1; + flow->free.next = flow_first_free; + flow_first_free = FLOW_IDX(flow);

which is doable even without explicit blocks, but much harder to follow.

Remember this is not a general deletion, only a "cancel" of the most recent allocation.

Oh, I thought that was only the case for this series and you would use that as actual deletion in another pending series (which I haven't finished reviewing yet). But now I'm not sure anymore why I was thinking this... Anyway... do we really need it, then? Can't we just mark the "failed" flows as whatever means "closed" for a specific protocol, and clean them up later, instead of calling cancel() right away?

...

To reduce fragmentation we are keeping the linked list of free clusters in strictly ascending order. The logic above is only correct if the entry we're freeing is before any other free entry in the table. That will be the case for the most recent allocation, because we always allocatte the first free entry in the table.

I see now. This isn't entirely clear from the "Theory of Operation", on the other hand it's probably a bad idea to overload that description with all these details.

...

...
Other than the comments I have, I would go ahead with your approach. My attempt below in case you're interested. The comment at the end of flow_del() shows the issue I'm facing:

struct flow_free_block { /* Must be first element */ struct flow_common f;

unsigned next_free; unsigned next_used; };

static union flow *flow_alloc(void) { union flow *flow;

if (flow_first_free >= FLOW_MAX) return NULL;

flow = flowtab + flow_first_free;

flow_first_free = flow->free.next_free;

memset(flow, 0, sizeof(*flow)); /* TODO: select region */ return flow; }

static void flow_del(union flow *del) { union flow *left, *right, *next_used = NULL, *first_free; fg

del->f.type = FLOW_TYPE_NONE;

left = (FLOW_IDX(del) > 0) ? FLOW(FLOW_IDX(del) - 1) : NULL; right = (FLOW_IDX(del) < FLOW_MAX - 1) ? FLOW(FLOW_IDX(del) + 1) : NULL;

first_free = flow_first_free < FLOW_MAX ? FLOW(flow_first_free) : NULL;

if (right) { if (right->f.type == FLOW_TYPE_NONE) del->free.next_used = right->free.next_used; else del->free.next_used = right; } else { del->free.next_used = FLOW_MAX; }

if (left && left->f.type == FLOW_TYPE_NONE) { left->free.next_free = FLOW_IDX(del); left->free.next_used = del->free.next_used; return; }

if (flow_first_free == FLOW_MAX) { flow_first_free = FLOW_IDX(del); } else if (flow_first_free > FLOW_IDX(del)) { flow_first_free = FLOW_IDX(del); del->free.next_free = flow_first_free; } else if (flow_first_free < FLOW_IDX(del)) { ; /* Walk free slots from flow_first_free until FLOW_IDX(del) to * find insertion point... but that makes deletion at most O(n), * perhaps O(log(n)), certainly not O(1).

Exactly. I can't see any way (short of far more complex data structures) around having a linear scan somewhere, although you can kind of choose whether it happens on alloc or free.

The idea of my implementation is to have it at free time - but to merge it with an existing linear scan so it won't have an additional cost.

Right, yes, I see now. This idea is also not terribly clear from the description, by the way, but I don't have a good proposal right now. -- Stefano

David Gibson

4 Jan 4 Jan

11:02 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:

...

On Mon, 1 Jan 2024 23:01:17 +1100 David Gibson wrote:

...
On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:

...
On Thu, 28 Dec 2023 19:25:25 +0100 Stefano Brivio wrote:

...
...
On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

[...]

[...]

I wonder if we really have to keep track of the number of (non-)entries in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly different variants, for example:

So... I think the version with (explicit) blocks has this fundamental advantage, on deletion:

...
...
+ flow->f.type = FLOW_TYPE_NONE; + /* Put it back in a length 1 free block, don't attempt to fully reverse + * flow_alloc()s steps. This will get folded together the next time + * flow_defer_handler runs anyway() */ + flow->free.n = 1; + flow->free.next = flow_first_free; + flow_first_free = FLOW_IDX(flow);

which is doable even without explicit blocks, but much harder to follow.

Remember this is not a general deletion, only a "cancel" of the most recent allocation.

Oh, I thought that was only the case for this series and you would use that as actual deletion in another pending series (which I haven't finished reviewing yet).

No. Not allowing deletion of any entry at any time is what I'm trading off to get both O(1) allocation and (effectively) O(1) deletion.

...

But now I'm not sure anymore why I was thinking this...

Anyway... do we really need it, then? Can't we just mark the "failed" flows as whatever means "closed" for a specific protocol, and clean them up later, instead of calling cancel() right away?

We could, but I'm not sure we want to. For starters, that requires protocol-specific behaviour whenever we need to back out an allocation like this. Not a big deal, since that's in protocol specific code already, but I think it's uglier than calling cancel. It also requires that the protocol specific deferred cleanup functions (e.g. tcp_flow_defer()) handle partially initialised entries. With 'cancel' we can back out just the initialisation steps we've already done (because we know where we've failed during init), then remove the entry. The deferred cleanup function only needs to deal with "complete" entries. Again, certainly possible, but IMO uglier than having 'cancel'. Finally, leaving the "cancelled" entry there until the next deferred pass means if we allocate other flows before that defer pass they'll avoid this slot. That works against trying to keep the entries reasonably compact.

...

...
To reduce fragmentation we are keeping the linked list of free clusters in strictly ascending order. The logic above is only correct if the entry we're freeing is before any other free entry in the table. That will be the case for the most recent allocation, because we always allocatte the first free entry in the table.

I see now. This isn't entirely clear from the "Theory of Operation", on the other hand it's probably a bad idea to overload that description with all these details.

I'll see what I can in terms of making that clearer in the description.

...

...
...
Other than the comments I have, I would go ahead with your approach. My attempt below in case you're interested. The comment at the end of flow_del() shows the issue I'm facing:

struct flow_free_block { /* Must be first element */ struct flow_common f;

unsigned next_free; unsigned next_used; };

static union flow *flow_alloc(void) { union flow *flow;

if (flow_first_free >= FLOW_MAX) return NULL;

flow = flowtab + flow_first_free;

flow_first_free = flow->free.next_free;

memset(flow, 0, sizeof(*flow)); /* TODO: select region */ return flow; }

static void flow_del(union flow *del) { union flow *left, *right, *next_used = NULL, *first_free; fg

del->f.type = FLOW_TYPE_NONE;

left = (FLOW_IDX(del) > 0) ? FLOW(FLOW_IDX(del) - 1) : NULL; right = (FLOW_IDX(del) < FLOW_MAX - 1) ? FLOW(FLOW_IDX(del) + 1) : NULL;

first_free = flow_first_free < FLOW_MAX ? FLOW(flow_first_free) : NULL;

if (right) { if (right->f.type == FLOW_TYPE_NONE) del->free.next_used = right->free.next_used; else del->free.next_used = right; } else { del->free.next_used = FLOW_MAX; }

if (left && left->f.type == FLOW_TYPE_NONE) { left->free.next_free = FLOW_IDX(del); left->free.next_used = del->free.next_used; return; }

if (flow_first_free == FLOW_MAX) { flow_first_free = FLOW_IDX(del); } else if (flow_first_free > FLOW_IDX(del)) { flow_first_free = FLOW_IDX(del); del->free.next_free = flow_first_free; } else if (flow_first_free < FLOW_IDX(del)) { ; /* Walk free slots from flow_first_free until FLOW_IDX(del) to * find insertion point... but that makes deletion at most O(n), * perhaps O(log(n)), certainly not O(1).

Exactly. I can't see any way (short of far more complex data structures) around having a linear scan somewhere, although you can kind of choose whether it happens on alloc or free.

The idea of my implementation is to have it at free time - but to merge it with an existing linear scan so it won't have an additional cost.

Right, yes, I see now. This idea is also not terribly clear from the description, by the way, but I don't have a good proposal right now.

I'll see what I can come up with. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

5 Jan 5 Jan

9:33 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Thu, 4 Jan 2024 21:02:19 +1100 David Gibson wrote:

...

On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:

...
On Mon, 1 Jan 2024 23:01:17 +1100 David Gibson wrote:

...
On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:

...
On Thu, 28 Dec 2023 19:25:25 +0100 Stefano Brivio wrote:

...
...
On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

[...]

[...]

I wonder if we really have to keep track of the number of (non-)entries in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly different variants, for example:

So... I think the version with (explicit) blocks has this fundamental advantage, on deletion:

...
...
+ flow->f.type = FLOW_TYPE_NONE; + /* Put it back in a length 1 free block, don't attempt to fully reverse + * flow_alloc()s steps. This will get folded together the next time + * flow_defer_handler runs anyway() */ + flow->free.n = 1; + flow->free.next = flow_first_free; + flow_first_free = FLOW_IDX(flow);

which is doable even without explicit blocks, but much harder to follow.

Remember this is not a general deletion, only a "cancel" of the most recent allocation.

Oh, I thought that was only the case for this series and you would use that as actual deletion in another pending series (which I haven't finished reviewing yet).

No. Not allowing deletion of any entry at any time is what I'm trading off to get both O(1) allocation and (effectively) O(1) deletion.

...
But now I'm not sure anymore why I was thinking this...

Anyway... do we really need it, then? Can't we just mark the "failed" flows as whatever means "closed" for a specific protocol, and clean them up later, instead of calling cancel() right away?

We could, but I'm not sure we want to. For starters, that requires protocol-specific behaviour whenever we need to back out an allocation like this. Not a big deal, since that's in protocol specific code already, but I think it's uglier than calling cancel.

It also requires that the protocol specific deferred cleanup functions (e.g. tcp_flow_defer()) handle partially initialised entries. With 'cancel' we can back out just the initialisation steps we've already done (because we know where we've failed during init), then remove the entry. The deferred cleanup function only needs to deal with "complete" entries. Again, certainly possible, but IMO uglier than having 'cancel'.

Okay, yes, I see now. Another doubt that comes to me now is: if you don't plan to use this alloc_cancel() thing anywhere else, the only reason why you are adding it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc() version that can fail. But at this point, speaking of ugliness, couldn't we just have a bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller can use to decide to abort earlier? To me it looks so much simpler and more robust. -- Stefano

David Gibson

10:39 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:

...

On Thu, 4 Jan 2024 21:02:19 +1100 David Gibson wrote:

...
On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:

...
On Mon, 1 Jan 2024 23:01:17 +1100 David Gibson wrote:

...
On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:

...
On Thu, 28 Dec 2023 19:25:25 +0100 Stefano Brivio wrote:

...
> On Thu, 21 Dec 2023 17:15:49 +1100 > David Gibson wrote: > > [...]

[...]

I wonder if we really have to keep track of the number of (non-)entries in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly different variants, for example:

So... I think the version with (explicit) blocks has this fundamental advantage, on deletion:

...
> + flow->f.type = FLOW_TYPE_NONE; > + /* Put it back in a length 1 free block, don't attempt to fully reverse > + * flow_alloc()s steps. This will get folded together the next time > + * flow_defer_handler runs anyway() */ > + flow->free.n = 1; > + flow->free.next = flow_first_free; > + flow_first_free = FLOW_IDX(flow);

which is doable even without explicit blocks, but much harder to follow.

Remember this is not a general deletion, only a "cancel" of the most recent allocation.

Oh, I thought that was only the case for this series and you would use that as actual deletion in another pending series (which I haven't finished reviewing yet).

No. Not allowing deletion of any entry at any time is what I'm trading off to get both O(1) allocation and (effectively) O(1) deletion.

...
But now I'm not sure anymore why I was thinking this...

Anyway... do we really need it, then? Can't we just mark the "failed" flows as whatever means "closed" for a specific protocol, and clean them up later, instead of calling cancel() right away?

We could, but I'm not sure we want to. For starters, that requires protocol-specific behaviour whenever we need to back out an allocation like this. Not a big deal, since that's in protocol specific code already, but I think it's uglier than calling cancel.

It also requires that the protocol specific deferred cleanup functions (e.g. tcp_flow_defer()) handle partially initialised entries. With 'cancel' we can back out just the initialisation steps we've already done (because we know where we've failed during init), then remove the entry. The deferred cleanup function only needs to deal with "complete" entries. Again, certainly possible, but IMO uglier than having 'cancel'.

Okay, yes, I see now.

Another doubt that comes to me now is: if you don't plan to use this alloc_cancel() thing anywhere else, the only reason why you are adding it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc() version that can fail.

But at this point, speaking of ugliness, couldn't we just have a bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller can use to decide to abort earlier? To me it looks so much simpler and more robust.

Well, we could, but there are a couple of reasons I don't love it. The first is abstraction: this returns explicit handling of the layout of the table to the protocol specific callers. It's not a huge deal right now, but once we have 4 or 5 protocols doing this, having to change all of them if we make any tiny change to the semantics of flow_first_free isn't great. The other issue is that to do this (without a bunch of fairly large and ugly temporaries) means we'd populate at least some of the fields in flow_common before we have officially "allocated" the entry. At that point it becomes a bit fuzzy as to when that allocation really occurs. Is it when we do the FLOW_MAX tesT? Is it when we write to f.type? Is it when we update flow_first_free? If we fail somewhere in the middle of that, what steps do we need to reverse? For those reasons I prefer the scheme presented. Fwiw, in an earlier draft I did this differently with a "flow_prealloc()", which was essentially the check against FLOW_MAX, then a later flow_alloc_commit(). I thought it turned out pretty confusing compared to the alloc/cancel approach. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

11:27 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Fri, 5 Jan 2024 20:39:50 +1100 David Gibson wrote:

...

On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:

...
On Thu, 4 Jan 2024 21:02:19 +1100 David Gibson wrote:

...
On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:

...
On Mon, 1 Jan 2024 23:01:17 +1100 David Gibson wrote:

...
On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:

...
On Thu, 28 Dec 2023 19:25:25 +0100 Stefano Brivio wrote:

> > On Thu, 21 Dec 2023 17:15:49 +1100 > > David Gibson wrote: > > > > [...] > > [...] > > I wonder if we really have to keep track of the number of (non-)entries > in the free "block", and if we have to be explicit about the two cases. > > I'm trying to find out if we can simplify the whole thing with slightly > different variants, for example:

So... I think the version with (explicit) blocks has this fundamental advantage, on deletion:

> > + flow->f.type = FLOW_TYPE_NONE; > > + /* Put it back in a length 1 free block, don't attempt to fully reverse > > + * flow_alloc()s steps. This will get folded together the next time > > + * flow_defer_handler runs anyway() */ > > + flow->free.n = 1; > > + flow->free.next = flow_first_free; > > + flow_first_free = FLOW_IDX(flow);

which is doable even without explicit blocks, but much harder to follow.

Remember this is not a general deletion, only a "cancel" of the most recent allocation.

Oh, I thought that was only the case for this series and you would use that as actual deletion in another pending series (which I haven't finished reviewing yet).

No. Not allowing deletion of any entry at any time is what I'm trading off to get both O(1) allocation and (effectively) O(1) deletion.

...
But now I'm not sure anymore why I was thinking this...

Anyway... do we really need it, then? Can't we just mark the "failed" flows as whatever means "closed" for a specific protocol, and clean them up later, instead of calling cancel() right away?

We could, but I'm not sure we want to. For starters, that requires protocol-specific behaviour whenever we need to back out an allocation like this. Not a big deal, since that's in protocol specific code already, but I think it's uglier than calling cancel.

It also requires that the protocol specific deferred cleanup functions (e.g. tcp_flow_defer()) handle partially initialised entries. With 'cancel' we can back out just the initialisation steps we've already done (because we know where we've failed during init), then remove the entry. The deferred cleanup function only needs to deal with "complete" entries. Again, certainly possible, but IMO uglier than having 'cancel'.

Okay, yes, I see now.

Another doubt that comes to me now is: if you don't plan to use this alloc_cancel() thing anywhere else, the only reason why you are adding it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc() version that can fail.

But at this point, speaking of ugliness, couldn't we just have a bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller can use to decide to abort earlier? To me it looks so much simpler and more robust.

Well, we could, but there are a couple of reasons I don't love it. The first is abstraction: this returns explicit handling of the layout of the table to the protocol specific callers. It's not a huge deal right now, but once we have 4 or 5 protocols doing this, having to change all of them if we make any tiny change to the semantics of flow_first_free isn't great.

Hmm, I don't get the difference in terms of abstraction between checking the return value of flow_alloc() and checking the return value of flow_may_alloc(). In both cases the protocol handlers know that there's a table, a function to reserve entries, and that reserving entries might fail... and not much else.

...

The other issue is that to do this (without a bunch of fairly large and ugly temporaries) means we'd populate at least some of the fields in flow_common before we have officially "allocated" the entry. At that point it becomes a bit fuzzy as to when that allocation really occurs. Is it when we do the FLOW_MAX tesT?

I would say yes -- after that we can't fail. I mean, we work with rather constrained structures for a number of reasons, which comes with a number of hard problems... let's at least reap the benefits of it?

...

Is it when we write to f.type? Is it when we update flow_first_free? If we fail somewhere in the middle of that, what steps do we need to reverse?

We can't fail in the middle of it, at the moment. Of course, if the "allocation" function changes, we might need to change the scheme. But is it really likely? And anyway it's just a few lines in your current version...

...

For those reasons I prefer the scheme presented. Fwiw, in an earlier draft I did this differently with a "flow_prealloc()", which was essentially the check against FLOW_MAX, then a later flow_alloc_commit(). I thought it turned out pretty confusing compared to the alloc/cancel approach.

The current flow_alloc_cancel() implementation is definitely simple and semantically clear. What worries me a bit is that you would have two different cases for free "blocks" of size one, depending on the order of the events. So if we want to debug something like that and we see a block with size one it might be a failed bind(), so a fake one, or also not: it might be an actual block with size one. Thinking of multithreading: defining flow_may_alloc() becomes more complicated because the caller can't just assume the "allocation" will succeed (as long as we don't cross an "epoll cycle" or something like that). But we'll probably need some form of locking or userspace RCU giving us barriers around the pair may_alloc() / alloc(). If we stick to the failing alloc(), this part is simpler, but the interpretation of flow_first_free and block sizes becomes non-trivial. Well, on the other hand, it's all simple enough that we can change it as needed (for example for multithreading). If we can hope that the new scheme is reasonably low on bugs and we'll probably never have to guess why a block has size one, I'm fine with the failing alloc() as well. -- Stefano

David Gibson

6 Jan 6 Jan

12:32 p.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Fri, Jan 05, 2024 at 11:27:54AM +0100, Stefano Brivio wrote:

...

On Fri, 5 Jan 2024 20:39:50 +1100 David Gibson wrote:

...
On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:

...
On Thu, 4 Jan 2024 21:02:19 +1100 David Gibson wrote:

...
On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:

...
On Mon, 1 Jan 2024 23:01:17 +1100 David Gibson wrote:

...
On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote: > On Thu, 28 Dec 2023 19:25:25 +0100 > Stefano Brivio wrote: > > > > On Thu, 21 Dec 2023 17:15:49 +1100 > > > David Gibson wrote: > > > > > > [...] > > > > [...] > > > > I wonder if we really have to keep track of the number of (non-)entries > > in the free "block", and if we have to be explicit about the two cases. > > > > I'm trying to find out if we can simplify the whole thing with slightly > > different variants, for example: > > So... I think the version with (explicit) blocks has this fundamental > advantage, on deletion: > > > > + flow->f.type = FLOW_TYPE_NONE; > > > + /* Put it back in a length 1 free block, don't attempt to fully reverse > > > + * flow_alloc()s steps. This will get folded together the next time > > > + * flow_defer_handler runs anyway() */ > > > + flow->free.n = 1; > > > + flow->free.next = flow_first_free; > > > + flow_first_free = FLOW_IDX(flow); > > which is doable even without explicit blocks, but much harder to > follow.

Remember this is not a general deletion, only a "cancel" of the most recent allocation.

Oh, I thought that was only the case for this series and you would use that as actual deletion in another pending series (which I haven't finished reviewing yet).

No. Not allowing deletion of any entry at any time is what I'm trading off to get both O(1) allocation and (effectively) O(1) deletion.

...
But now I'm not sure anymore why I was thinking this...

Anyway... do we really need it, then? Can't we just mark the "failed" flows as whatever means "closed" for a specific protocol, and clean them up later, instead of calling cancel() right away?

We could, but I'm not sure we want to. For starters, that requires protocol-specific behaviour whenever we need to back out an allocation like this. Not a big deal, since that's in protocol specific code already, but I think it's uglier than calling cancel.

It also requires that the protocol specific deferred cleanup functions (e.g. tcp_flow_defer()) handle partially initialised entries. With 'cancel' we can back out just the initialisation steps we've already done (because we know where we've failed during init), then remove the entry. The deferred cleanup function only needs to deal with "complete" entries. Again, certainly possible, but IMO uglier than having 'cancel'.

Okay, yes, I see now.

Another doubt that comes to me now is: if you don't plan to use this alloc_cancel() thing anywhere else, the only reason why you are adding it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc() version that can fail.

But at this point, speaking of ugliness, couldn't we just have a bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller can use to decide to abort earlier? To me it looks so much simpler and more robust.

Well, we could, but there are a couple of reasons I don't love it. The first is abstraction: this returns explicit handling of the layout of the table to the protocol specific callers. It's not a huge deal right now, but once we have 4 or 5 protocols doing this, having to change all of them if we make any tiny change to the semantics of flow_first_free isn't great.

Hmm, I don't get the difference in terms of abstraction between checking the return value of flow_alloc() and checking the return value of flow_may_alloc().

Oh, sorry, I thought you were proposing open-coding the check against FLOW_MAX, like it is at the moment. See below for comments on a flow_may_alloc() or similar call.

...

In both cases the protocol handlers know that there's a table, a function to reserve entries, and that reserving entries might fail... and not much else.

...
The other issue is that to do this (without a bunch of fairly large and ugly temporaries) means we'd populate at least some of the fields in flow_common before we have officially "allocated" the entry. At that point it becomes a bit fuzzy as to when that allocation really occurs. Is it when we do the FLOW_MAX tesT?

I would say yes -- after that we can't fail.

Uh.. we can't fail to allocate. We can fail for other reasons.

...

I mean, we work with rather constrained structures for a number of reasons, which comes with a number of hard problems... let's at least reap the benefits of it?

I'm not sure what you're getting at here.

...

...
Is it when we write to f.type? Is it when we update flow_first_free? If we fail somewhere in the middle of that, what steps do we need to reverse?

We can't fail in the middle of it, at the moment.

Yes we can, that's kind of the whole point of this.

...

Of course, if the "allocation" function changes, we might need to change the scheme. But is it really likely? And anyway it's just a few lines in your current version...

...
For those reasons I prefer the scheme presented. Fwiw, in an earlier draft I did this differently with a "flow_prealloc()", which was essentially the check against FLOW_MAX, then a later flow_alloc_commit(). I thought it turned out pretty confusing compared to the alloc/cancel approach.

The current flow_alloc_cancel() implementation is definitely simple and semantically clear.

What worries me a bit is that you would have two different cases for free "blocks" of size one, depending on the order of the events. So if we want to debug something like that and we see a block with size one it might be a failed bind(), so a fake one, or also not: it might be an actual block with size one.

The cluster of size one from cancel is still a valid free cluster that satisfies all the usual invaraints, it's not "fake". It does mean that we could get two contiguous free clusters, which we wouldn't otherwise. The idea is that they'll be merged back together on the next deferred scan, but as noted on a different subthread, that's not currently working and I'll have to fix it.

...

Thinking of multithreading: defining flow_may_alloc() becomes more complicated because the caller can't just assume the "allocation" will succeed (as long as we don't cross an "epoll cycle" or something like that). But we'll probably need some form of locking or userspace RCU giving us barriers around the pair may_alloc() / alloc().

For multi-threaded, I really thin we'd want alloc/free semantics, not may_alloc/alloc semantics. We have to change state in some way at "may alloc" time, or something else could try to allocate the same slot. "cancel" semantics also don't make sense here, because we can no longer be confident that the alloc we did above is still the most recent allloc. So a bunch of things would need to change for multi-threading.

...

If we stick to the failing alloc(), this part is simpler, but the interpretation of flow_first_free and block sizes becomes non-trivial.

Again, not sure what you're getting at.

...

Well, on the other hand, it's all simple enough that we can change it as needed (for example for multithreading). If we can hope that the new scheme is reasonably low on bugs and we'll probably never have to guess why a block has size one, I'm fine with the failing alloc() as well.

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

2:02 p.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Sat, 6 Jan 2024 22:32:10 +1100 David Gibson wrote:

...

On Fri, Jan 05, 2024 at 11:27:54AM +0100, Stefano Brivio wrote:

...
On Fri, 5 Jan 2024 20:39:50 +1100 David Gibson wrote:

...
On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:

...
On Thu, 4 Jan 2024 21:02:19 +1100 David Gibson wrote:

...
On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:

...
On Mon, 1 Jan 2024 23:01:17 +1100 David Gibson wrote:

> On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote: > > On Thu, 28 Dec 2023 19:25:25 +0100 > > Stefano Brivio wrote: > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100 > > > > David Gibson wrote: > > > > > > > > [...] > > > > > > [...] > > > > > > I wonder if we really have to keep track of the number of (non-)entries > > > in the free "block", and if we have to be explicit about the two cases. > > > > > > I'm trying to find out if we can simplify the whole thing with slightly > > > different variants, for example: > > > > So... I think the version with (explicit) blocks has this fundamental > > advantage, on deletion: > > > > > > + flow->f.type = FLOW_TYPE_NONE; > > > > + /* Put it back in a length 1 free block, don't attempt to fully reverse > > > > + * flow_alloc()s steps. This will get folded together the next time > > > > + * flow_defer_handler runs anyway() */ > > > > + flow->free.n = 1; > > > > + flow->free.next = flow_first_free; > > > > + flow_first_free = FLOW_IDX(flow); > > > > which is doable even without explicit blocks, but much harder to > > follow. > > Remember this is not a general deletion, only a "cancel" of the most > recent allocation.

Oh, I thought that was only the case for this series and you would use that as actual deletion in another pending series (which I haven't finished reviewing yet).

No. Not allowing deletion of any entry at any time is what I'm trading off to get both O(1) allocation and (effectively) O(1) deletion.

...
But now I'm not sure anymore why I was thinking this...

Anyway... do we really need it, then? Can't we just mark the "failed" flows as whatever means "closed" for a specific protocol, and clean them up later, instead of calling cancel() right away?

We could, but I'm not sure we want to. For starters, that requires protocol-specific behaviour whenever we need to back out an allocation like this. Not a big deal, since that's in protocol specific code already, but I think it's uglier than calling cancel.

It also requires that the protocol specific deferred cleanup functions (e.g. tcp_flow_defer()) handle partially initialised entries. With 'cancel' we can back out just the initialisation steps we've already done (because we know where we've failed during init), then remove the entry. The deferred cleanup function only needs to deal with "complete" entries. Again, certainly possible, but IMO uglier than having 'cancel'.

Okay, yes, I see now.

Another doubt that comes to me now is: if you don't plan to use this alloc_cancel() thing anywhere else, the only reason why you are adding it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc() version that can fail.

But at this point, speaking of ugliness, couldn't we just have a bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller can use to decide to abort earlier? To me it looks so much simpler and more robust.

Well, we could, but there are a couple of reasons I don't love it. The first is abstraction: this returns explicit handling of the layout of the table to the protocol specific callers. It's not a huge deal right now, but once we have 4 or 5 protocols doing this, having to change all of them if we make any tiny change to the semantics of flow_first_free isn't great.

Hmm, I don't get the difference in terms of abstraction between checking the return value of flow_alloc() and checking the return value of flow_may_alloc().

Oh, sorry, I thought you were proposing open-coding the check against FLOW_MAX, like it is at the moment. See below for comments on a flow_may_alloc() or similar call.

...
In both cases the protocol handlers know that there's a table, a function to reserve entries, and that reserving entries might fail... and not much else.

...
The other issue is that to do this (without a bunch of fairly large and ugly temporaries) means we'd populate at least some of the fields in flow_common before we have officially "allocated" the entry. At that point it becomes a bit fuzzy as to when that allocation really occurs. Is it when we do the FLOW_MAX tesT?

I would say yes -- after that we can't fail.

Uh.. we can't fail to allocate. We can fail for other reasons.

Yes, I meant, if we pass the FLOW_MAX check, then we can't fail to allocate -- therefore flow_may_alloc() would contain just that check.

...

...
I mean, we work with rather constrained structures for a number of reasons, which comes with a number of hard problems... let's at least reap the benefits of it?

I'm not sure what you're getting at here.

...I'm saying that we don't actually allocate memory, which means that we can have a simple test (on FLOW_MAX), and then we know that, at least within the handling of a single epoll event, we will be able to allocate memory, without possible races. We couldn't do that with an actual heap allocation because that's out of our control. This is one of the few aspects where using statically allocated memory only could make our lives easier (while in general we need to spend more effort for other things).

...

...
...
Is it when we write to f.type? Is it when we update flow_first_free? If we fail somewhere in the middle of that, what steps do we need to reverse?

We can't fail in the middle of it, at the moment.

Yes we can, that's kind of the whole point of this.

But as of this patch flow_alloc() can only fail on the FLOW_MAX check...

...

...
Of course, if the "allocation" function changes, we might need to change the scheme. But is it really likely? And anyway it's just a few lines in your current version...

...
For those reasons I prefer the scheme presented. Fwiw, in an earlier draft I did this differently with a "flow_prealloc()", which was essentially the check against FLOW_MAX, then a later flow_alloc_commit(). I thought it turned out pretty confusing compared to the alloc/cancel approach.

The current flow_alloc_cancel() implementation is definitely simple and semantically clear.

What worries me a bit is that you would have two different cases for free "blocks" of size one, depending on the order of the events. So if we want to debug something like that and we see a block with size one it might be a failed bind(), so a fake one, or also not: it might be an actual block with size one.

The cluster of size one from cancel is still a valid free cluster that satisfies all the usual invaraints, it's not "fake". It does mean that we could get two contiguous free clusters, which we wouldn't otherwise.

Okay, yes, not having contiguous free clusters is probably not a valuable invariant anyway.

...

The idea is that they'll be merged back together on the next deferred scan, but as noted on a different subthread, that's not currently working and I'll have to fix it.

Yes yes that part is clear to me. I don't find it very problematic, or in any way "wrong".

...

...
Thinking of multithreading: defining flow_may_alloc() becomes more complicated because the caller can't just assume the "allocation" will succeed (as long as we don't cross an "epoll cycle" or something like that). But we'll probably need some form of locking or userspace RCU giving us barriers around the pair may_alloc() / alloc().

For multi-threaded, I really thin we'd want alloc/free semantics, not may_alloc/alloc semantics. We have to change state in some way at "may alloc" time, or something else could try to allocate the same slot. "cancel" semantics also don't make sense here, because we can no longer be confident that the alloc we did above is still the most recent allloc. So a bunch of things would need to change for multi-threading.

By the way, another possibility would be to just go ahead and call socket() and bind(), or accept() the connection from the socket, then if we fail to "allocate" a flow we'll close the socket. That doesn't solve the synchronisation problem entirely but it makes it simpler to handle. Now, a bound socket or an accepted connection is visible to the user which is (I guess) the reason why you want to avoid this, but if we can't open/accept new connections it's getting pretty bad anyway... so should we really care? Actually, calling accept() and then close() on a socket (peer gets RST) is probably a nicer behaviour than letting a peer hang because we ran out of slots.

...

...
If we stick to the failing alloc(), this part is simpler, but the interpretation of flow_first_free and block sizes becomes non-trivial.

Again, not sure what you're getting at.

By "failing alloc()" I mean the alloc/cancel semantics, as opposed to an allocation function that can't fail. There, as you mentioned, we can no longer be confident that the canceled allocation would be the most recent one (which changes the meaning of flow_first_free).

...

...
Well, on the other hand, it's all simple enough that we can change it as needed (for example for multithreading). If we can hope that the new scheme is reasonably low on bugs and we'll probably never have to guess why a block has size one, I'm fine with the failing alloc() as well.

-- Stefano

David Gibson

7 Jan 7 Jan

6:20 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Sat, Jan 06, 2024 at 02:02:01PM +0100, Stefano Brivio wrote:

...

On Sat, 6 Jan 2024 22:32:10 +1100 David Gibson wrote:

...
On Fri, Jan 05, 2024 at 11:27:54AM +0100, Stefano Brivio wrote:

...
On Fri, 5 Jan 2024 20:39:50 +1100 David Gibson wrote:

...
On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:

...
On Thu, 4 Jan 2024 21:02:19 +1100 David Gibson wrote:

...
On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote: > On Mon, 1 Jan 2024 23:01:17 +1100 > David Gibson wrote: > > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote: > > > On Thu, 28 Dec 2023 19:25:25 +0100 > > > Stefano Brivio wrote: > > > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100 > > > > > David Gibson wrote: > > > > > > > > > > [...] > > > > > > > > [...] > > > > > > > > I wonder if we really have to keep track of the number of (non-)entries > > > > in the free "block", and if we have to be explicit about the two cases. > > > > > > > > I'm trying to find out if we can simplify the whole thing with slightly > > > > different variants, for example: > > > > > > So... I think the version with (explicit) blocks has this fundamental > > > advantage, on deletion: > > > > > > > > + flow->f.type = FLOW_TYPE_NONE; > > > > > + /* Put it back in a length 1 free block, don't attempt to fully reverse > > > > > + * flow_alloc()s steps. This will get folded together the next time > > > > > + * flow_defer_handler runs anyway() */ > > > > > + flow->free.n = 1; > > > > > + flow->free.next = flow_first_free; > > > > > + flow_first_free = FLOW_IDX(flow); > > > > > > which is doable even without explicit blocks, but much harder to > > > follow. > > > > Remember this is not a general deletion, only a "cancel" of the most > > recent allocation. > > Oh, I thought that was only the case for this series and you would use > that as actual deletion in another pending series (which I haven't > finished reviewing yet).

No. Not allowing deletion of any entry at any time is what I'm trading off to get both O(1) allocation and (effectively) O(1) deletion.

> But now I'm not sure anymore why I was thinking this... > > Anyway... do we really need it, then? Can't we just mark the "failed" > flows as whatever means "closed" for a specific protocol, and clean > them up later, instead of calling cancel() right away?

We could, but I'm not sure we want to. For starters, that requires protocol-specific behaviour whenever we need to back out an allocation like this. Not a big deal, since that's in protocol specific code already, but I think it's uglier than calling cancel.

It also requires that the protocol specific deferred cleanup functions (e.g. tcp_flow_defer()) handle partially initialised entries. With 'cancel' we can back out just the initialisation steps we've already done (because we know where we've failed during init), then remove the entry. The deferred cleanup function only needs to deal with "complete" entries. Again, certainly possible, but IMO uglier than having 'cancel'.

Okay, yes, I see now.

Another doubt that comes to me now is: if you don't plan to use this alloc_cancel() thing anywhere else, the only reason why you are adding it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc() version that can fail.

But at this point, speaking of ugliness, couldn't we just have a bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller can use to decide to abort earlier? To me it looks so much simpler and more robust.

Well, we could, but there are a couple of reasons I don't love it. The first is abstraction: this returns explicit handling of the layout of the table to the protocol specific callers. It's not a huge deal right now, but once we have 4 or 5 protocols doing this, having to change all of them if we make any tiny change to the semantics of flow_first_free isn't great.

Hmm, I don't get the difference in terms of abstraction between checking the return value of flow_alloc() and checking the return value of flow_may_alloc().

Oh, sorry, I thought you were proposing open-coding the check against FLOW_MAX, like it is at the moment. See below for comments on a flow_may_alloc() or similar call.

...
In both cases the protocol handlers know that there's a table, a function to reserve entries, and that reserving entries might fail... and not much else.

...
The other issue is that to do this (without a bunch of fairly large and ugly temporaries) means we'd populate at least some of the fields in flow_common before we have officially "allocated" the entry. At that point it becomes a bit fuzzy as to when that allocation really occurs. Is it when we do the FLOW_MAX tesT?

I would say yes -- after that we can't fail.

Uh.. we can't fail to allocate. We can fail for other reasons.

Yes, I meant, if we pass the FLOW_MAX check, then we can't fail to allocate -- therefore flow_may_alloc() would contain just that check.

Right, ok.

...

...
...
I mean, we work with rather constrained structures for a number of reasons, which comes with a number of hard problems... let's at least reap the benefits of it?

I'm not sure what you're getting at here.

...I'm saying that we don't actually allocate memory, which means that we can have a simple test (on FLOW_MAX), and then we know that, at least within the handling of a single epoll event, we will be able to allocate memory, without possible races.

We couldn't do that with an actual heap allocation because that's out of our control.

This is one of the few aspects where using statically allocated memory only could make our lives easier (while in general we need to spend more effort for other things).

Ok, I understand.

...

...
...
...
Is it when we write to f.type? Is it when we update flow_first_free? If we fail somewhere in the middle of that, what steps do we need to reverse?

We can't fail in the middle of it, at the moment.

Yes we can, that's kind of the whole point of this.

But as of this patch flow_alloc() can only fail on the FLOW_MAX check...

I'm not talking about allocation failure, I'm talking about the other failures in setting up a new flow. We can fail between checking FLOW_MAX and "committing" to the allocation by updating flow_first_free (or whatever). Before the point we finally commit, we do want to write some fields of the flow. That's the obvious place to put the address from accept() for example. But there are some things we mustn't touch, because it will mess up how the slot is interpreted as a free slot (type, and anything in the protocol specific fields, because that will overlap struct flow_free_cluster. I don't like having this fragile intermediate state with subtle rules about what we can and can't safely do to the flow entry. With alloc cancel, we can do whatever we like as soon as we've alloc()ed, and on failure the cancel() will restore whatever free cluster information needs to be there.

...

...
...
Of course, if the "allocation" function changes, we might need to change the scheme. But is it really likely? And anyway it's just a few lines in your current version...

...
For those reasons I prefer the scheme presented. Fwiw, in an earlier draft I did this differently with a "flow_prealloc()", which was essentially the check against FLOW_MAX, then a later flow_alloc_commit(). I thought it turned out pretty confusing compared to the alloc/cancel approach.

The current flow_alloc_cancel() implementation is definitely simple and semantically clear.

What worries me a bit is that you would have two different cases for free "blocks" of size one, depending on the order of the events. So if we want to debug something like that and we see a block with size one it might be a failed bind(), so a fake one, or also not: it might be an actual block with size one.

The cluster of size one from cancel is still a valid free cluster that satisfies all the usual invaraints, it's not "fake". It does mean that we could get two contiguous free clusters, which we wouldn't otherwise.

Okay, yes, not having contiguous free clusters is probably not a valuable invariant anyway.

Right, that was my feeling.

...

...
The idea is that they'll be merged back together on the next deferred scan, but as noted on a different subthread, that's not currently working and I'll have to fix it.

Yes yes that part is clear to me. I don't find it very problematic, or in any way "wrong".

...
...
Thinking of multithreading: defining flow_may_alloc() becomes more complicated because the caller can't just assume the "allocation" will succeed (as long as we don't cross an "epoll cycle" or something like that). But we'll probably need some form of locking or userspace RCU giving us barriers around the pair may_alloc() / alloc().

For multi-threaded, I really thin we'd want alloc/free semantics, not may_alloc/alloc semantics. We have to change state in some way at "may alloc" time, or something else could try to allocate the same slot. "cancel" semantics also don't make sense here, because we can no longer be confident that the alloc we did above is still the most recent allloc. So a bunch of things would need to change for multi-threading.

By the way, another possibility would be to just go ahead and call socket() and bind(), or accept() the connection from the socket, then if we fail to "allocate" a flow we'll close the socket. That doesn't solve the synchronisation problem entirely but it makes it simpler to handle.

We could. But (a) the may-alloc check is much cheaper than the syscalls and (b) that means we need to store all information from those syscalls temporarily before copying it into the flow table, which is kind of nasty.

...

Now, a bound socket or an accepted connection is visible to the user which is (I guess) the reason why you want to avoid this, but if we can't open/accept new connections it's getting pretty bad anyway... so should we really care?

Right, that's certainly not my concern about it.

...

Actually, calling accept() and then close() on a socket (peer gets RST) is probably a nicer behaviour than letting a peer hang because we ran out of slots.

Maybe, maybe not. If we are persistently overloaded, then yes. However if we just have a sudden spike of short lived connections, then we could just "bounce" against the flow limit, but we'll still process those other connection attempts in a little bit.

...

...
...
If we stick to the failing alloc(), this part is simpler, but the interpretation of flow_first_free and block sizes becomes non-trivial.

Again, not sure what you're getting at.

By "failing alloc()" I mean the alloc/cancel semantics, as opposed to an allocation function that can't fail.

That bit I got.

...

There, as you mentioned, we can no longer be confident that the canceled allocation would be the most recent one (which changes the meaning of flow_first_free).

If we're multithreading, then I think this allocation scheme needs some revision anyway.

...

...
...
Well, on the other hand, it's all simple enough that we can change it as needed (for example for multithreading). If we can hope that the new scheme is reasonably low on bugs and we'll probably never have to guess why a block has size one, I'm fine with the failing alloc() as well.

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

1 Jan 1 Jan

11:44 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

...
Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too.

Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details.

Signed-off-by: David Gibson --- flow.c | 177 +++++++++++++++++++++++++++++++++++++-------------- flow.h | 1 + flow_table.h | 16 ++++- passt.c | 2 + tcp.c | 23 ------- tcp_conn.h | 3 - tcp_splice.c | 11 ---- 7 files changed, 146 insertions(+), 87 deletions(-)

diff --git a/flow.c b/flow.c index 7b99b58..421e6b5 100644 --- a/flow.c +++ b/flow.c @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type");

/* Global Flow Table */ -unsigned flow_count; +unsigned flow_first_free;

As mentioned in the comment to 10/13: this should be initialised to zero.

I'm pretty sure it already is..

...

...
union flow flowtab[FLOW_MAX];

/* Last time the flow timers ran */ @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); }

+/**

As a general comment: interesting -- I wonder if this already has a name, but I couldn't find anything similar in the literature. Perhaps it's equivalent to a flattened search tree where free/busy slots can be found by following the links.

Yeah, I'm not aware of this technique being used elsewhere, though I can't rule it out, of course.

...

...
+ * DOC: Theory of Operation - allocation and freeing of flow entries

"allocating and freeing flow entries" ...flows better for me.

Good idea, changed.

...

I think this comment should be before the declaration of flowtab[] -- above here, with a function in between, is not exactly where I expected to find it as I started reading the next paragraph.

Fair enough, moved.

...

...
+ * + * Each flow takes a single slot in flowtab[]. Moving entries in that table

You jump right away to "moving entries", we know why, but I guess any other reader would be lost at this point.

In general, it took me some effort to grasp this paragraph. What about:

* Flows are entries in the flowtab[]. We need to routinely scan the whole table * to perform deferred bookkeeping tasks on active entries, and sparse empty * slots waste time and worsen data locality. * * But keeping the table compacted by moving entries on deletion is fiddly: it * requires updating hash tables, and the epoll references to the flows. Instead, * we implement the compromise described below.

?

Much better, thanks

...

...
+ * (which we used to do) is fiddly and possibly expensive: it requires updating + * the hash table indexing flows, and may require updating epoll data which + * references the flow by index. However, we also want to keep the active + * entries in the table compact where possible, because otherwise scanning + * through the entire table becomes more expensive. This describes the + * compromise implemented below. + * + * Free blocks + * A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries

This gave me the idea that blocks are some kind of "bucket" with a fixed size. I think we should either mention that the blocks have variable size, or call them sequences, or clusters (of free slots) perhaps?

Fair point. I've changed everything to the term "free cluster".

...

...
+ * in flowtab. The first entry in each block contains metadata, specifically + * the number of entries in the block, and the index of the next (non + * contiguous) free block (struct flow_free_block).

...it doesn't waste any space, but I don't think we actually need to store the number of "entries" (slots?)... more details below.

I haven't seen a way to avoid it, while maintaining the properties we want, but I'm open to suggestions.

...

...
+ * + * Free block list + * flow_first_free gives the index of the first entry of the first (lowest + * index) free block. Each free block has the index of the next free block, + * or MAX_FLOW if it is the last free block. Together these form a linked + * list of free blocks, in strictly increasing order of index. + * + * Allocation + * When allocating a new flow, we always use the first entry of the first + * free block, that is, at index flow_first_free. If the block has more than + * one entry, flow_first_free is updated to the next entry, which is updated + * to represent the new smaller free block. Otherwise the free block is + * eliminated and flow_first_free is updated to the next free block. + * + * Scanning the table + * Theoretically, scanning the table requires FLOW_MAX iterations. However, + * when we encounter the start of a free block, we can immediately skip to + * its end, meaning that in practice we only need (number of active

after its end

better yet, "...immediately skip past it,".

...

...
+ * connections) + (number of free blocks) iterations. + * + * Freeing + * We can only free entries when scanning the whole flow table in + * flow_defer_handler(). This is what lets us maintain the fee block list in

s/fee/free/

Fixed.

...

...
+ * index sorted order. As we scan we keep track of whether the previous + * entry was in a free block or not. If so when an entry is freed (its + * deferred handler returns 'true'), we add it to that free block. Otherwise + * we create a new free block for it and chain it to the last free block we + * scanned. + */ + /** * flow_alloc() - Allocate a new flow * @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ union flow *flow_alloc(void) { - if (flow_count >= FLOW_MAX) + union flow *flow = &flowtab[flow_first_free]; + + if (flow_first_free >= FLOW_MAX) return NULL;

- return &flowtab[flow_count++]; + ASSERT(flow->f.type == FLOW_TYPE_NONE); + ASSERT(flow->free.n >= 1); + + if (flow->free.n > 1) { + /* Use one entry from the block */ + union flow *next = &flowtab[++flow_first_free];

...but we just checked (flow_first_free < FLOW_MAX) above. Now you increment flow_first_free by one.

I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n should be 0, so we shouldn't end up in this case -- but it's a bit confusing.

That's correct, except that it's flow->free.n == 1, not 0 (we're checking for strictly > 1 above). If it's not, that's a free cluster claiming it extends past the end of the table. I've added ASSERT(flow_first_free + flow->free.n <= FLOW_MAX); Above the if to try to clarify this.

...

...
+ + ASSERT(FLOW_IDX(next) < FLOW_MAX); + ASSERT(next->f.type == FLOW_TYPE_NONE); + ASSERT(next->free.n == 0); + + next->free.n = flow->free.n - 1; + next->free.next = flow->free.next; + } else { + /* Use the entire block */ + flow_first_free = flow->free.next; + }

I wonder if we really have to keep track of the number of (non-)entries in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly different variants, for example:

struct flow_free_block { [...] unsigned next_free; unsigned next_busy; };

or:

struct flow_free_block { [...] unsigned next_free; };

or even some sort of double-linked list:

struct flow_free_block { [...] unsigned prev_free; unsigned next_free; };

but I couldn't quite find a more elegant solution yet.

...
+ + memset(flow, 0, sizeof(*flow)); + return flow; }

/** @@ -70,48 +137,15 @@ union flow *flow_alloc(void) */ void flow_alloc_cancel(union flow *flow) { - ASSERT(FLOW_IDX(flow) == flow_count - 1); - memset(flow, 0, sizeof(*flow)); - flow_count--; -} - -/** - * flow_table_compact() - Perform compaction on flow table - * @c: Execution context - * @hole: Pointer to recently closed flow - */ -static void flow_table_compact(const struct ctx *c, union flow *hole) -{ - union flow *from; - - if (FLOW_IDX(hole) == --flow_count) { - debug("flow: table compaction: maximum index was %u (%p)", - FLOW_IDX(hole), (void *)hole); - memset(hole, 0, sizeof(*hole)); - return; - } - - from = flowtab + flow_count; - memcpy(hole, from, sizeof(*hole)); - - switch (from->f.type) { - case FLOW_TCP: - tcp_tap_conn_update(c, &from->tcp, &hole->tcp); - break; - case FLOW_TCP_SPLICE: - tcp_splice_conn_update(c, &hole->tcp_splice); - break; - default: - die("Unexpected %s in tcp_table_compact()", - FLOW_TYPE(&from->f)); - } - - debug("flow: table compaction (%s): old index %u, new index %u, " - "from: %p, to: %p", - FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), - (void *)from, (void *)hole); - - memset(from, 0, sizeof(*from)); + ASSERT(flow_first_free > FLOW_IDX(flow)); + + flow->f.type = FLOW_TYPE_NONE; + /* Put it back in a length 1 free block, don't attempt to fully reverse + * flow_alloc()s steps. This will get folded together the next time + * flow_defer_handler runs anyway() */ + flow->free.n = 1; + flow->free.next = flow_first_free; + flow_first_free = FLOW_IDX(flow); }

/** @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole) */ void flow_defer_handler(const struct ctx *c, const struct timespec *now) { + struct flow_free_block *free_head = NULL; + unsigned *last_next = &flow_first_free; bool timer = false; - union flow *flow; + unsigned idx;

if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) { timer = true; flow_timer_run = *now; }

- for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { + for (idx = 0; idx < FLOW_MAX; idx++) { + union flow *flow = &flowtab[idx]; bool closed = false;

+ if (flow->f.type == FLOW_TYPE_NONE) { + /* Start of a free block */ + free_head = &flow->free; + *last_next = idx; + last_next = &free_head->next; + /* Skip the rest of the block */ + idx += free_head->n - 1; + continue; + } + +

Stray tabs.

...
switch (flow->f.type) { + case FLOW_TYPE_NONE: + closed = true; + break; case FLOW_TCP: closed = tcp_flow_defer(flow); break; @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) ; }

- if (closed) - flow_table_compact(c, flow); + if (closed) {

Should this be a flow_del() function perhaps, possibly taking flow and free_head as argument?

I suppose it could, but since it's not that long I'd prefer to inline it to emphasise the fact that for the tracking scheme to work, deletions *must* be done in the context of scanning the full table.

...

...
+ flow->f.type = FLOW_TYPE_NONE; + + if (free_head) { + /* Add slot to current free block */ + ASSERT(idx == FLOW_IDX(free_head) + free_head->n); + free_head->n++; + flow->free.n = flow->free.next = 0; + } else { + /* Create new free block */ + free_head = &flow->free; + free_head->n = 1; + *last_next = idx; + last_next = &free_head->next; + } + } else { + free_head = NULL; + } } + + *last_next = FLOW_MAX; +} + +/** + * flow_init() - Initialise flow related data structures + */ +void flow_init(void) +{ + /* Initial state is a single free block containing the whole table */ + flowtab[0].free.n = FLOW_MAX; + flowtab[0].free.next = FLOW_MAX; } diff --git a/flow.h b/flow.h index 8064f0e..48a0ab4 100644 --- a/flow.h +++ b/flow.h @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)

union flow;

+void flow_init(void); void flow_defer_handler(const struct ctx *c, const struct timespec *now);

void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) diff --git a/flow_table.h b/flow_table.h index 2773a2b..34fc679 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,19 @@

#include "tcp_conn.h"

+/** + * struct flow_free_block - Information about a block of free entries + * @f: Generic flow information + * @n: Number of entries in this free block (including this one) + * @next: Index of next free block + */ +struct flow_free_block { + /* Must be first element */ + struct flow_common f; + unsigned n; + unsigned next; +}; + /** * union flow - Descriptor for a logical packet flow (e.g. connection) * @f: Fields common between all variants @@ -17,12 +30,13 @@ */ union flow { struct flow_common f; + struct flow_free_block free; struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; };

/* Global Flow Table */ -extern unsigned flow_count; +extern unsigned flow_first_free; extern union flow flowtab[];

diff --git a/passt.c b/passt.c index 71bea8f..d315438 100644 --- a/passt.c +++ b/passt.c @@ -285,6 +285,8 @@ int main(int argc, char **argv)

clock_gettime(CLOCK_MONOTONIC, &now);

+ flow_init(); + if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) exit(EXIT_FAILURE);

diff --git a/tcp.c b/tcp.c index a38deb0..6aacb56 100644 --- a/tcp.c +++ b/tcp.c @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c, tc_hash[b] = FLOW_SIDX_NONE; }

-/** - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table - * @c: Execution context - * @old: Old location of tcp_tap_conn - * @new: New location of tcp_tap_conn - */ -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new) - -{ - unsigned b = tcp_hash_probe(c, old); - - if (!flow_at_sidx(tc_hash[b])) - return; /* Not in hash table, nothing to update */ - - tc_hash[b] = FLOW_SIDX(new, TAPSIDE); - - debug("TCP: hash table update: old index %u, new index %u, sock %i, " - "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b); - - tcp_epoll_ctl(c, new); -} - /** * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context diff --git a/tcp_conn.h b/tcp_conn.h index 636224e..a5f5cfe 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -155,9 +155,6 @@ struct tcp_splice_conn { extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE];

-void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new); -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); bool tcp_flow_defer(union flow *flow); bool tcp_splice_flow_defer(union flow *flow); void tcp_splice_timer(const struct ctx *c, union flow *flow); diff --git a/tcp_splice.c b/tcp_splice.c index 0711b19..7aae623 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, } while (0)

-/** - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table - * @c: Execution context - * @new: New location of tcp_splice_conn - */ -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) -{ - if (tcp_splice_epoll_ctl(c, new)) - conn_flag(c, new, CLOSING); -} - /** * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed) * @flow: Flow table entry for this connection

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

2 Jan 2 Jan

7:13 p.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Mon, 1 Jan 2024 21:44:54 +1100 David Gibson wrote:

...

On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

[...]

...
void flow_defer_handler(const struct ctx *c, const struct timespec *now) { + struct flow_free_block *free_head = NULL; + unsigned *last_next = &flow_first_free; bool timer = false; - union flow *flow; + unsigned idx;

if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) { timer = true; flow_timer_run = *now; }

- for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { + for (idx = 0; idx < FLOW_MAX; idx++) { + union flow *flow = &flowtab[idx]; bool closed = false;

+ if (flow->f.type == FLOW_TYPE_NONE) { + /* Start of a free block */ + free_head = &flow->free; + *last_next = idx; + last_next = &free_head->next; + /* Skip the rest of the block */ + idx += free_head->n - 1; + continue; + } + +

Stray tabs.

...
switch (flow->f.type) { + case FLOW_TYPE_NONE: + closed = true; + break;

...more important than stray tabs, I noticed only now: how would we ever hit this switch case, if you're checking for this in the if clause just before?

...

...
...
case FLOW_TCP: closed = tcp_flow_defer(flow); break;

-- Stefano

David Gibson

5 Jan 5 Jan

10:45 a.m.

New subject: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table

On Tue, Jan 02, 2024 at 07:13:48PM +0100, Stefano Brivio wrote:

...

On Mon, 1 Jan 2024 21:44:54 +1100 David Gibson wrote:

...
On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 17:15:49 +1100 David Gibson wrote:

[...]

...
void flow_defer_handler(const struct ctx *c, const struct timespec *now) { + struct flow_free_block *free_head = NULL; + unsigned *last_next = &flow_first_free; bool timer = false; - union flow *flow; + unsigned idx;

if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) { timer = true; flow_timer_run = *now; }

- for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { + for (idx = 0; idx < FLOW_MAX; idx++) { + union flow *flow = &flowtab[idx]; bool closed = false;

+ if (flow->f.type == FLOW_TYPE_NONE) { + /* Start of a free block */ + free_head = &flow->free; + *last_next = idx; + last_next = &free_head->next; + /* Skip the rest of the block */ + idx += free_head->n - 1; + continue; + } + +

Stray tabs.

Fixed.

...

...
...
...
switch (flow->f.type) { + case FLOW_TYPE_NONE: + closed = true; + break;

...more important than stray tabs, I noticed only now: how would we ever hit this switch case, if you're checking for this in the if clause just before?

Ah.. it can't.. but fixing it is a bit more complex. The intention here is that if there are multiple contiguous free clusters we'll merge them together, which this currently won't do. I'll have to take another look at that.

...

...
...
...
case FLOW_TCP: closed = tcp_flow_defer(flow); break;

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

721

Age (days ago)

738

Last active (days ago)

List overview

Download

39 comments

2 participants

participants (2)

David Gibson
Stefano Brivio