[PATCH v3 00/14] Introduce forwarding table
This creates a new data structure for recording our port forwarding: a table of individual forwarding rules. This is intended to replace the existing forwarding mode, map and delta structures. For now, only the delta[] array is removed. map is still used, but only for automatic forwards. Mode is still used, but only has significance during configuration. There's still a lot that can be done for flexible forwarding, but this introduces the core data structure, and does enough to fix at least one concrete defect with the current logic (bug 187). I think it would be a good point to give it a solid review and maybe merge. I'm aware that this is a very large complex series, that will be difficult to review. I think that's more or less inevitable implementing a feature as broad and complex as "flexible forwarding". If it helps at all, I'm fine if it's not reviewed all at once - I won't assume review is complete until either you say so, or I get comments or R-b on each patch. This applies on top of all 3 of my outstanding series in this area: "Allow listen functions to return fds", "Clean ups to epoll references" (v2) and "Small cleanups to splice forwarding logic". David Gibson (14): inany: Extend inany_ntop() to treat NULL as a fully unspecified address conf, fwd: Keep a table of our port forwarding configuration conf: Accurately record ifname and address for outbound forwards conf, fwd: Record "auto" port forwards in forwarding table fwd: Make space to store listening sockets in forward table ip: Add ipproto_name() function fwd, tcp, udp: Set up listening sockets based on forward table tcp, udp: Remove old auto-forwarding socket arrays conf, fwd: Check forwarding table for conflicting rules fwd: Generate auto-forward exclusions from socket fd tables flow, fwd: Consult rules table when forwarding a new flow from socket fwd: Remap ports based directly on forwarding rule fwd, tcp, udp: Add forwarding rule to listening socket epoll references flow, fwd: Optimise forwarding rule lookup using epoll ref when possible conf.c | 118 +++++++++------ flow.c | 59 ++++++-- flow_table.h | 2 +- fwd.c | 379 ++++++++++++++++++++++++++++++++++++++++++++++--- fwd.h | 68 ++++++++- icmp.c | 2 +- inany.c | 28 +++- inany.h | 1 + ip.c | 26 ++++ ip.h | 2 + tcp.c | 154 ++------------------ tcp.h | 6 +- udp.c | 147 +++---------------- udp.h | 7 +- udp_flow.c | 14 +- udp_flow.h | 2 +- udp_internal.h | 4 +- 17 files changed, 641 insertions(+), 378 deletions(-) -- 2.52.0
At present, we set up forwarding as we parse the -t, and -u options, not
keeping a persistent data structure with all the details. We do have
some information in struct fwd_ports, but it doesn't capture all the nuance
that the options do.
As a first step to generalising our forwarding model, add a table of all
the forwarding rules to struct fwd_ports. For now it covers only explicit
forwards, not automatic, and we don't do anything with it other than print
some additional debug information. We'll do more with it in future
patches.
Signed-off-by: David Gibson
-T and -U options don't allow specifying a listening address. Usually this
will listen on *%lo in the guest. However on kernels without unprivileged
SO_BINDTODEVICE that's not possible so we instead listen separately on
127.0.0.1 and ::1.
Currently that's handled at the point we actually set up the listens,
we record both address and ifname as NULL in the forwarding table
entry. That will cause trouble for future extensions we want, so
update this to accurately create the forwarding table: either a single
rule with ifname == "lo" or two rules with addresses of 127.0.0.1 and
::1.
As a bonus, this gives the user a warning if they specify an explicit
outbound forwarding on a kernel without SO_BINDTODEVICE. The existing
warning for missing SO_BINDTODEVICE incorrectly covered only the case
of -T auto or -U auto.
Signed-off-by: David Gibson
Currently the forwarding table records details of explicit port forwards,
but nothing for -[tTuU] auto. That looks a little odd on the debug output,
and will be a problem for future changes.
Extend the forward table to have rules for auto-scanned forwards,
using a new FWD_SCAN flag. For now the mechanism of auto port
forwarding isn't updated, and we will only create a single FWD_SCAN
rule per protocol and direction. We'll better integrate auto scanning
with other forward table mechanics in future.
Signed-off-by: David Gibson
Add a function to get the name of an IP protocol from its number. Usually
this would be done by getprotobynumber(), but that requires access to
/etc/protocols and might allocate. We can't do either of those once we've
self-isolated.
Signed-off-by: David Gibson
Now that we've moved listening socket management to the new forwarding
table data structure, the existing arrays of socket fds are maintained,
but never consulted. Remove them.
Signed-off-by: David Gibson
Previously we created inbound listening sockets as we parsed the forwarding
options (-t, -u) whereas outbound listening sockets were created during
{tcp,udp}_init(). Now that we have a data structure recording the full
details of the listening options we can move all listening socket creation
to {tcp,udp}_init(). This means that errors for either direction are
detected and reported the same way.
Introduce fwd_listen_sync() which synchronizes the state of listening
sockets to the forwarding rules table, both for fixed and automatic
forwards.
This does cause a change in semantics for "exclude only" port
specifications. Previously an option like -t ~6000 wouldn't cause a
fatal error, as long as we could bind at least one port. Now, it
requires at least one port for each generated rule; that is for each
of the contiguous blocks of ports the specification resolves to. With
typical ephemeral ports settings that's one port each in 1..5999,
6001..32767 and 61000..65535.
Preserving the exact behaviour for this case would require a considerably
more complex data structure, so I'm hoping this is a sufficiently niche
case for the change to be acceptable.
Signed-off-by: David Gibson
When auto-forwarding based on port scans, we must exclude our own
listening ports, to avoid circular forwards. Currently we use the
(previous value of the) forwarding bitmaps for the reverse direction
to determine that.
Instead, generate it from the tables of listening sockets that we now
maintain. For now this seems like a lot more work to get to the same
place. However, it does mean we're basing our exclusions directly on the
relevant information: which of the scanned listens belong to us. More
importantly, it's a step towards removing the bitmaps entirely.
Signed-off-by: David Gibson
At present, we don't keep track of the fds for listening sockets (except
for "auto" ones). Since the fd is stored in the epoll reference, we didn't
need an alternative source of it for the various handlers.
However, we're intending to allow dynamic changes to forwarding
configuration in future. That means we need a way to enumerate sockets so
we can close them on removal of a forward.
Extend our forwarding table data structure to make space for all the
listening sockets. To avoid allocation, this imposes another limit:
we could run out of space for socket fds before we run out of slots
for forwarding rules.
We don't actually do anything with the allocate spaced yet. For
"auto" forwards it's redundant with existing arrays. We'll fix both
of those in later patches.
Signed-off-by: David Gibson
It's possible for a user to supply conflicting forwarding parameters, e.g.
$ pasta -t 80:8080 -t 127.0.0.1/80:8888
We give a warning in this case, but it's based on the legacy
forwarding bitmaps. This is too strict, because it will also warn on
cases that shouldn't conflict because they use different addresses,
e.g.
$ pasta -t 192.0.2.1/80:8080 127.0.0.1/80:8888
Theoretically, it's also too loose because it won't take into account
auto-scan forwarding rules. We can't hit that in practice now,
because we only ever have one auto-scan rule and nothing else, but we
want to remove that restriction in future.
Replace the bitmap based check with a check based on actually scanning
the forwarding rules for conflicts.
Signed-off-by: David Gibson
We now have a formal array of forwarding rules. However, we don't actually
consult it when we forward a new flow. Instead we rely on (a) implicit
information (we wouldn't be here if there wasn't a listening socket for the
rule) and (b) the legacy delta[] data structure.
Start addressing this, by searching for a matching forwarding rule when
attempting to forward a new flow. For now this is incomplete:
* We only do this for socket-initiated flows
* We make a potentially costly linear lookup
* We don't actually use the matching rule for anything yet
We'll address each of those in later patches.
Signed-off-by: David Gibson
Currently we remap port numbers based on the legacy delta[] array, which
is indexed only by original port number, not the listening address. Now
that we look up a forwarding rule entry in flow_target(), we can use this
entry to directly determine the correct remapped port. Implement this,
and remove the old delta[] array.
Link: https://bugs.passt.top/show_bug.cgi?id=187
Signed-off-by: David Gibson
Now that we have a table of all our forwarding rules, every listening
socket can be associated with a specific rule. Add an index allowing us to
locate that rule from the socket's epoll reference. We don't use it yet,
but we'll use it to optimise rule lookup when forwarding new flows.
Signed-off-by: David Gibson
Now that listening sockets include a reference to the forwarding rule which
created them we can, in many cases, avoid a linear search of the forwarding
table when we want to find the relevant rule. Instead we can take the
rule index from the socket's epoll reference, and use that to immediately
find the correct rule.
This is conceptually simple, but requires a moderate amount of plumbing to
get the index from the reference through to the rule lookup. We still
allow fall back to linear search if we don't have the index, and this may
(rarely) be used in the udp_flush_flow() case, where we could get packets
for one flow on a different flow's socket, rather than through a listening
socket as usual.
Signed-off-by: David Gibson
On 1/8/26 03:29, David Gibson wrote:
Add a function to get the name of an IP protocol from its number. Usually this would be done by getprotobynumber(), but that requires access to /etc/protocols and might allocate. We can't do either of those once we've self-isolated.
Signed-off-by: David Gibson
--- ip.c | 27 +++++++++++++++++++++++++++ ip.h | 2 ++ 2 files changed, 29 insertions(+) diff --git a/ip.c b/ip.c index 9a7f4c54..f1d224bd 100644 --- a/ip.c +++ b/ip.c @@ -67,3 +67,30 @@ found: *proto = nh; return true; } + +/** + * ipproto_name() - Get IP protocol name from number + * @proto: IP protocol number + * + * Return: pointer to name of protocol @proto + * + * Usually this would be done with getprotobynumber(3) but that reads + * /etc/protocols and might allocate, which isn't possible for us once + * self-isolated. + */ +/* cppcheck-suppress unusedFunction */ +const char *ipproto_name(uint8_t proto) +{ + switch (proto) { + case IPPROTO_ICMP: + return "ICMP"; + case IPPROTO_TCP: + return "TCP"; + case IPPROTO_UDP: + return "UDP"; + case IPPROTO_ICMPV6: + return "ICMPv6"; + default: + return "<unknown protocol>"; + } +} diff --git a/ip.h b/ip.h index 5830b923..42b417c5 100644 --- a/ip.h +++ b/ip.h @@ -116,6 +116,7 @@ static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h) }
bool ipv6_l4hdr(struct iov_tail *data, uint8_t *proto, size_t *dlen); +const char *ipproto_name(uint8_t proto);
/* IPv6 link-local all-nodes multicast address, ff02::1 */ static const struct in6_addr in6addr_ll_all_nodes = { @@ -135,4 +136,5 @@ static const struct in_addr in4addr_broadcast = { 0xffffffff }; #define IPV6_MIN_MTU 1280 #endif
+
Extra blank line
#endif /* IP_H */
Reviewed-by: Laurent Vivier
On Thu, Jan 08, 2026 at 02:22:21PM +0100, Laurent Vivier wrote:
On 1/8/26 03:29, David Gibson wrote:
Add a function to get the name of an IP protocol from its number. Usually this would be done by getprotobynumber(), but that requires access to /etc/protocols and might allocate. We can't do either of those once we've self-isolated.
Signed-off-by: David Gibson
--- ip.c | 27 +++++++++++++++++++++++++++ ip.h | 2 ++ 2 files changed, 29 insertions(+) diff --git a/ip.c b/ip.c index 9a7f4c54..f1d224bd 100644 --- a/ip.c +++ b/ip.c @@ -67,3 +67,30 @@ found: *proto = nh; return true; } + +/** + * ipproto_name() - Get IP protocol name from number + * @proto: IP protocol number + * + * Return: pointer to name of protocol @proto + * + * Usually this would be done with getprotobynumber(3) but that reads + * /etc/protocols and might allocate, which isn't possible for us once + * self-isolated. + */ +/* cppcheck-suppress unusedFunction */ +const char *ipproto_name(uint8_t proto) +{ + switch (proto) { + case IPPROTO_ICMP: + return "ICMP"; + case IPPROTO_TCP: + return "TCP"; + case IPPROTO_UDP: + return "UDP"; + case IPPROTO_ICMPV6: + return "ICMPv6"; + default: + return "<unknown protocol>"; + } +} diff --git a/ip.h b/ip.h index 5830b923..42b417c5 100644 --- a/ip.h +++ b/ip.h @@ -116,6 +116,7 @@ static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h) } bool ipv6_l4hdr(struct iov_tail *data, uint8_t *proto, size_t *dlen); +const char *ipproto_name(uint8_t proto); /* IPv6 link-local all-nodes multicast address, ff02::1 */ static const struct in6_addr in6addr_ll_all_nodes = { @@ -135,4 +136,5 @@ static const struct in_addr in4addr_broadcast = { 0xffffffff }; #define IPV6_MIN_MTU 1280 #endif +
Extra blank line
Oops, fixed.
#endif /* IP_H */
Reviewed-by: Laurent Vivier
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 8 Jan 2026 13:29:36 +1100
David Gibson
At present, we set up forwarding as we parse the -t, and -u options, not keeping a persistent data structure with all the details. We do have some information in struct fwd_ports, but it doesn't capture all the nuance that the options do.
As a first step to generalising our forwarding model, add a table of all the forwarding rules to struct fwd_ports. For now it covers only explicit forwards, not automatic, and we don't do anything with it other than print some additional debug information. We'll do more with it in future patches.
Signed-off-by: David Gibson
--- conf.c | 80 ++++++++++++++++++++++++++++++------------------- fwd.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 35 +++++++++++++++++++++- 3 files changed, 179 insertions(+), 31 deletions(-) diff --git a/conf.c b/conf.c index c936664b..127e69f5 100644 --- a/conf.c +++ b/conf.c @@ -137,7 +137,7 @@ static int parse_port_range(const char *s, char **endptr, * @last: Last port to forward * @exclude: Bitmap of ports to exclude * @to: Port to translate @first to when forwarding - * @weak: Ignore errors, as long as at least one port is mapped + * @flags: Flags for forwarding entries */ static void conf_ports_range_except(const struct ctx *c, char optname, const char *optarg, struct fwd_ports *fwd, @@ -145,10 +145,11 @@ static void conf_ports_range_except(const struct ctx *c, char optname, const char *ifname, uint16_t first, uint16_t last, const uint8_t *exclude, uint16_t to, - bool weak) + uint8_t flags) { + unsigned delta = to - first; bool bound_one = false; - unsigned i; + unsigned base, i; int fd;
if (first == 0) { @@ -172,37 +173,45 @@ static void conf_ports_range_except(const struct ctx *c, char optname, } }
- for (i = first; i <= last; i++) { - if (bitmap_isset(exclude, i)) + for (base = first; base <= last; base++) { + if (bitmap_isset(exclude, base)) continue;
- if (bitmap_isset(fwd->map, i)) { - warn( -"Altering mapping of already mapped port number: %s", optarg); - } + for (i = base; i <= last; i++) {
This nested loop is a bit convoluted, but I don't have better ideas and I suppose it goes away later (I just reached 10/14 so far).
+ if (bitmap_isset(exclude, i)) + break;
- bitmap_set(fwd->map, i); - fwd->delta[i] = to - first; + if (bitmap_isset(fwd->map, i)) { + warn( +"Altering mapping of already mapped port number: %s", optarg); + }
- if (optname == 't') - fd = tcp_listen(c, PIF_HOST, addr, ifname, i); - else if (optname == 'u') - fd = udp_listen(c, PIF_HOST, addr, ifname, i); - else - /* No way to check in advance for -T and -U */ - fd = 0; + if (optname == 't') + fd = tcp_listen(c, PIF_HOST, addr, ifname, i); + else if (optname == 'u') + fd = udp_listen(c, PIF_HOST, addr, ifname, i); + else + /* No way to check in advance for -T and -U */ + fd = 0; + + if (fd == -ENFILE || fd == -EMFILE) { + die( +"Can't open enough sockets for port specifier: %s", + optarg); + }
- if (fd == -ENFILE || fd == -EMFILE) { - die("Can't open enough sockets for port specifier: %s", - optarg); + if (fd >= 0) { + bound_one = true; + } else if (!(flags & FWD_WEAK)) { + die( +"Failed to bind port %u (%s) for option '-%c %s'", + i, strerror_(-fd), optname, optarg); + } }
- if (fd >= 0) { - bound_one = true; - } else if (!weak) { - die("Failed to bind port %u (%s) for option '-%c %s'", - i, strerror_(-fd), optname, optarg); - } + fwd_rule_add(fwd, flags, addr, ifname, base, i - 1, + base + delta); + base = i - 1; }
if (!bound_one) @@ -272,7 +281,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg, conf_ports_range_except(c, optname, optarg, fwd, NULL, NULL, 1, NUM_PORTS - 1, exclude, - 1, true); + 1, FWD_WEAK); return; }
@@ -357,7 +366,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg, conf_ports_range_except(c, optname, optarg, fwd, addr, ifname, 1, NUM_PORTS - 1, exclude, - 1, true); + 1, FWD_WEAK); return; }
@@ -390,7 +399,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg, addr, ifname, orig_range.first, orig_range.last, exclude, - mapped_range.first, false); + mapped_range.first, 0); } while ((p = next_chunk(p, ',')));
return; @@ -1225,6 +1234,17 @@ dns6: info(" %s", c->dns_search[i].n); } } + + info("Inbound TCP forwarding:"); + fwd_rules_print(&c->tcp.fwd_in); + info("Inbound UDP forwarding:"); + fwd_rules_print(&c->udp.fwd_in); + if (c->mode == MODE_PASTA) { + info("Outbound TCP forwarding:"); + fwd_rules_print(&c->tcp.fwd_out); + info("Outbound UDP forwarding:"); + fwd_rules_print(&c->udp.fwd_out); + } }
/** diff --git a/fwd.c b/fwd.c index 961c533f..ad398e54 100644 --- a/fwd.c +++ b/fwd.c @@ -13,6 +13,7 @@ * Author: David Gibson
*/ +#include
#include #include #include @@ -22,6 +23,8 @@ #include "util.h" #include "ip.h" +#include "siphash.h" +#include "inany.h" #include "fwd.h" #include "passt.h" #include "lineread.h" @@ -300,6 +303,19 @@ parse_err: warn("Unable to parse %s", PORT_RANGE_SYSCTL); }
+/** + * fwd_rule_addr() - Return match address for a rule + * @rule: Forwarding rule + * + * Return: Rule's match address, or NULL, if it matches any address
What about: matching address for rule, NULL if it matches all addresses
+ */ +static const union inany_addr *fwd_rule_addr(const struct fwd_rule *rule) +{ + if (rule->flags & FWD_DUAL_STACK_ANY) + return NULL;
Nit: usual newline here.
+ return &rule->addr; +} + /** * fwd_port_is_ephemeral() - Is port number ephemeral? * @port: Port number @@ -313,6 +329,85 @@ bool fwd_port_is_ephemeral(in_port_t port) return (port >= fwd_ephemeral_min) && (port <= fwd_ephemeral_max); }
+/** + * fwd_rule_add() - Add a rule to a forwarding table + * @fwd: Table to add to + * @flags: Flags for this entry + * @addr: Our address to forward (NULL for both 0.0.0.0 and ::) + * @ifname: Only forward from this interface name, if non-empty + * @first: First port number to forward + * @last: Last port number to forward + * @to: First port of target port range to map to + */ +void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, + const union inany_addr *addr, const char *ifname, + in_port_t first, in_port_t last, in_port_t to) +{ + /* Flags which can be set from the caller */ + const uint8_t allowed_flags = FWD_WEAK; + struct fwd_rule *new; + unsigned port; + + ASSERT(!(flags & ~allowed_flags)); + + if (fwd->count >= ARRAY_SIZE(fwd->rules)) + die("Too many port forwarding ranges"); + + new = &fwd->rules[fwd->count++]; + new->flags = flags; + + if (addr) { + new->addr = *addr; + } else { + new->addr = inany_any6; + new->flags |= FWD_DUAL_STACK_ANY; + } + + memset(new->ifname, 0, sizeof(new->ifname)); + if (ifname) + strncpy(new->ifname, ifname, sizeof(new->ifname));
Coverity Scan complains because we copy exactly up to sizeof(new->ifname) bytes, and the terminator might be missing: /home/sbrivio/passt/fwd.c:368:3: Type: Buffer not null terminated (BUFFER_SIZE) /home/sbrivio/passt/fwd.c:351:2: 1. path: Condition "!(flags & -3 /* ~allowed_flags */)", taking true branch. /home/sbrivio/passt/fwd.c:353:2: 2. path: Condition "fwd->count >= 128U /* (int)(sizeof (fwd->rules) / sizeof (fwd->rules[0])) */", taking false branch. /home/sbrivio/passt/fwd.c:359:2: 3. path: Condition "addr", taking true branch. /home/sbrivio/passt/fwd.c:361:2: 4. path: Falling through to end of if statement. /home/sbrivio/passt/fwd.c:367:2: 5. path: Condition "ifname", taking true branch. /home/sbrivio/passt/fwd.c:368:3: 6. buffer_size_warning: Calling "strncpy" with a maximum size argument of 16 bytes on destination array "new->ifname" of size 16 bytes might leave the destination string unterminated. /home/sbrivio/passt/fwd.c:370:2:
+ + ASSERT(first <= last); + new->first = first; + new->last = last; + + new->to = to; + + for (port = new->first; port <= new->last; port++) { + /* Fill in the legacy data structures to match the table */ + bitmap_set(fwd->map, port); + fwd->delta[port] = new->to - new->first; + } +} + +/** + * fwd_rules_print() - Print forwarding rules for debugging + * @fwd: Table to print + */ +void fwd_rules_print(const struct fwd_ports *fwd) +{ + unsigned i; + + for (i = 0; i < fwd->count; i++) { + const struct fwd_rule *rule = &fwd->rules[i]; + const char *weak = rule->flags & FWD_WEAK ? " WEAK" : "";
Should we print " might fail" or " can fail" instead of " WEAK"? This is for users.
+ const char *percent = *rule->ifname ? "%" : ""; + char addr[INANY_ADDRSTRLEN]; + + inany_ntop(fwd_rule_addr(rule), addr, sizeof(addr)); + + if (rule->first == rule->last) { + info(" [%s]%s%s:%hu => %hu %s", + addr, percent, rule->ifname, + rule->first, rule->to, weak); + } else { + info(" [%s]%s%s:%hu-%hu => %hu-%hu %s", + addr, percent, rule->ifname, rule->first, rule->last, + rule->to, rule->last - rule->first + rule->to, weak); + } + } +} + /* See enum in kernel's include/net/tcp_states.h */ #define UDP_LISTEN 0x07 #define TCP_LISTEN 0x0a diff --git a/fwd.h b/fwd.h index 934bab33..3dfc7959 100644 --- a/fwd.h +++ b/fwd.h @@ -16,6 +16,30 @@ struct flowside; void fwd_probe_ephemeral(void); bool fwd_port_is_ephemeral(in_port_t port);
+/** + * struct fwd_rule - Forwarding rule governing a range of ports + * @addr: Address to forward from + * @ifname: Interface to forward from + * @first: First port number to forward + * @last: Last port number to forward + * @to: Port number to forward port @first to. + * @flags: Flag mask + * FWD_DUAL_STACK_ANY - match any IPv4 or IPv6 address (@addr should be ::) + * FWD_WEAK - Don't give an error if binds fail for some forwards
Usually we document bit flags inline as we define them (see struct tcp_tap_conn), I guess it's a bit easier to follow, but this has the advantage of looking less cramped (unless we add comment lines before the #defines, which would also be reasonable I think). Not a strong preference either way.
+ * + * FIXME: @addr and @ifname currently ignored for outbound tables + */ +struct fwd_rule { + union inany_addr addr; + char ifname[IFNAMSIZ]; + in_port_t first, last, to;
Having one line for each helps (me at least) visualising the size of the struct. Not a strong preference.
+#define FWD_DUAL_STACK_ANY BIT(0) +#define FWD_WEAK BIT(1) + uint8_t flags; +}; + +#define MAX_FWD_RULES 128 + /** * union fwd_listen_ref - information about a single listening socket * @port: Bound port number of the socket @@ -44,6 +68,8 @@ enum fwd_ports_mode { * @mode: Overall forwarding mode (all, none, auto, specific ports) * @scan4: /proc/net fd to scan for IPv4 ports when in AUTO mode * @scan6: /proc/net fd to scan for IPv6 ports when in AUTO mode + * @count: Number of forwarding rules + * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between the original destination and mapped port number */ @@ -51,14 +77,21 @@ struct fwd_ports { enum fwd_ports_mode mode; int scan4; int scan6; + unsigned count; + struct fwd_rule rules[MAX_FWD_RULES]; uint8_t map[PORT_BITMAP_SIZE]; in_port_t delta[NUM_PORTS]; };
#define FWD_PORT_SCAN_INTERVAL 1000 /* ms */
+void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, + const union inany_addr *addr, const char *ifname, + in_port_t first, in_port_t last, in_port_t to); +void fwd_rules_print(const struct fwd_ports *fwd); + void fwd_scan_ports_init(struct ctx *c); -void fwd_scan_ports_timer(struct ctx *c, const struct timespec *now); +void fwd_scan_ports_timer(struct ctx * c, const struct timespec *now);
bool nat_inbound(const struct ctx *c, const union inany_addr *addr, union inany_addr *translated);
-- Stefano
On Thu, 8 Jan 2026 13:29:38 +1100
David Gibson
Currently the forwarding table records details of explicit port forwards, but nothing for -[tTuU] auto. That looks a little odd on the debug output, and will be a problem for future changes.
Extend the forward table to have rules for auto-scanned forwards, using a new FWD_SCAN flag. For now the mechanism of auto port forwarding isn't updated, and we will only create a single FWD_SCAN rule per protocol and direction. We'll better integrate auto scanning with other forward table mechanics in future.
Signed-off-by: David Gibson
--- conf.c | 31 ++++++++++++++++++++++++++----- fwd.c | 18 +++++++++++------- fwd.h | 2 ++ 3 files changed, 39 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index b486fefe..0bcf80d7 100644 --- a/conf.c +++ b/conf.c @@ -135,7 +135,7 @@ static int parse_port_range(const char *s, char **endptr, * @ifname: Listening interface * @first: First port to forward * @last: Last port to forward - * @exclude: Bitmap of ports to exclude + * @exclude: Bitmap of ports to exclude (may be NULL) * @to: Port to translate @first to when forwarding * @flags: Flags for forwarding entries */ @@ -168,11 +168,11 @@ static void conf_ports_range_except(const struct ctx *c, char optname, }
for (base = first; base <= last; base++) { - if (bitmap_isset(exclude, base)) + if (exclude && bitmap_isset(exclude, base)) continue;
for (i = base; i <= last; i++) { - if (bitmap_isset(exclude, i)) + if (exclude && bitmap_isset(exclude, i)) break;
if (bitmap_isset(fwd->map, i)) { @@ -180,9 +180,9 @@ static void conf_ports_range_except(const struct ctx *c, char optname, "Altering mapping of already mapped port number: %s", optarg); }
- if (optname == 't') + if (!(flags & FWD_SCAN) && optname == 't') fd = tcp_listen(c, PIF_HOST, addr, ifname, i); - else if (optname == 'u') + else if (!(flags & FWD_SCAN) && optname == 'u') fd = udp_listen(c, PIF_HOST, addr, ifname, i); else /* No way to check in advance for -T and -U */ @@ -2202,6 +2202,27 @@ void conf(struct ctx *c, int argc, char **argv) if (!c->udp.fwd_out.mode) c->udp.fwd_out.mode = fwd_default;
+ if (c->tcp.fwd_in.mode == FWD_AUTO) { + conf_ports_range_except(c, 't', "auto", &c->tcp.fwd_in, + NULL, NULL, 1, NUM_PORTS - 1, + NULL, 1, FWD_SCAN); + } + if (c->tcp.fwd_out.mode == FWD_AUTO) { + conf_ports_range_except(c, 'T', "auto", &c->tcp.fwd_out, + NULL, "lo", 1, NUM_PORTS - 1, + NULL, 1, FWD_SCAN); + } + if (c->udp.fwd_in.mode == FWD_AUTO) { + conf_ports_range_except(c, 'u', "auto", &c->udp.fwd_in, + NULL, NULL, 1, NUM_PORTS - 1, + NULL, 1, FWD_SCAN); + } + if (c->udp.fwd_out.mode == FWD_AUTO) { + conf_ports_range_except(c, 'U', "auto", &c->udp.fwd_out, + NULL, "lo", 1, NUM_PORTS - 1, + NULL, 1, FWD_SCAN); + } + if (!c->quiet) conf_print(c); } diff --git a/fwd.c b/fwd.c index ad398e54..69aca441 100644 --- a/fwd.c +++ b/fwd.c @@ -344,7 +344,7 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, in_port_t first, in_port_t last, in_port_t to) { /* Flags which can be set from the caller */ - const uint8_t allowed_flags = FWD_WEAK; + const uint8_t allowed_flags = FWD_WEAK | FWD_SCAN; struct fwd_rule *new; unsigned port;
@@ -375,7 +375,8 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags,
for (port = new->first; port <= new->last; port++) { /* Fill in the legacy data structures to match the table */ - bitmap_set(fwd->map, port); + if (!(new->flags & FWD_SCAN)) + bitmap_set(fwd->map, port); fwd->delta[port] = new->to - new->first; } } @@ -391,19 +392,22 @@ void fwd_rules_print(const struct fwd_ports *fwd) for (i = 0; i < fwd->count; i++) { const struct fwd_rule *rule = &fwd->rules[i]; const char *weak = rule->flags & FWD_WEAK ? " WEAK" : ""; + const char *scan = rule->flags & FWD_SCAN ? " AUTO" : ""; const char *percent = *rule->ifname ? "%" : ""; char addr[INANY_ADDRSTRLEN];
inany_ntop(fwd_rule_addr(rule), addr, sizeof(addr));
if (rule->first == rule->last) { - info(" [%s]%s%s:%hu => %hu %s", + info(" [%s]%s%s:%hu => %hu %s%s", addr, percent, rule->ifname, - rule->first, rule->to, weak); + rule->first, rule->to, weak, scan); } else { - info(" [%s]%s%s:%hu-%hu => %hu-%hu %s", - addr, percent, rule->ifname, rule->first, rule->last, - rule->to, rule->last - rule->first + rule->to, weak); + info(" [%s]%s%s:%hu-%hu => %hu-%hu %s%s", + addr, percent, rule->ifname, + rule->first, rule->last, + rule->to, rule->last - rule->first + rule->to, + weak, scan); } } } diff --git a/fwd.h b/fwd.h index 3dfc7959..94869c2a 100644 --- a/fwd.h +++ b/fwd.h @@ -26,6 +26,7 @@ bool fwd_port_is_ephemeral(in_port_t port); * @flags: Flag mask * FWD_DUAL_STACK_ANY - match any IPv4 or IPv6 address (@addr should be ::) * FWD_WEAK - Don't give an error if binds fail for some forwards + * FWD_SCAN - Only forward if we scan a listener on the target
Nit: I guess this could be slightly more descriptive than "scan a listener", mentioning for example "if the corresponding port is bound on the target".
* * FIXME: @addr and @ifname currently ignored for outbound tables */ @@ -35,6 +36,7 @@ struct fwd_rule { in_port_t first, last, to; #define FWD_DUAL_STACK_ANY BIT(0) #define FWD_WEAK BIT(1) +#define FWD_SCAN BIT(2) uint8_t flags; };
-- Stefano
On Thu, 8 Jan 2026 13:29:39 +1100
David Gibson
At present, we don't keep track of the fds for listening sockets (except for "auto" ones). Since the fd is stored in the epoll reference, we didn't need an alternative source of it for the various handlers.
However, we're intending to allow dynamic changes to forwarding configuration in future. That means we need a way to enumerate sockets so we can close them on removal of a forward.
Extend our forwarding table data structure to make space for all the listening sockets. To avoid allocation, this imposes another limit: we could run out of space for socket fds before we run out of slots for forwarding rules.
We don't actually do anything with the allocate spaced yet. For "auto" forwards it's redundant with existing arrays. We'll fix both of those in later patches.
Signed-off-by: David Gibson
--- fwd.c | 10 +++++++++- fwd.h | 13 +++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/fwd.c b/fwd.c index 69aca441..f27a4220 100644 --- a/fwd.c +++ b/fwd.c @@ -345,6 +345,7 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, { /* Flags which can be set from the caller */ const uint8_t allowed_flags = FWD_WEAK | FWD_SCAN; + unsigned num = (unsigned)last - first + 1; struct fwd_rule *new; unsigned port;
@@ -352,6 +353,8 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags,
if (fwd->count >= ARRAY_SIZE(fwd->rules)) die("Too many port forwarding ranges"); + if ((fwd->listen_sock_count + num) > ARRAY_SIZE(fwd->listen_socks)) + die("Too many listening sockets");
Here, and above: we plan to trigger this from a client at runtime, and if there are too many listening sockets, or too many rules/ranges, we should fail and report failure (ENOBUFS I guess) instead of quitting.
new = &fwd->rules[fwd->count++]; new->flags = flags; @@ -373,8 +376,13 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags,
new->to = to;
+ new->socks = &fwd->listen_socks[fwd->listen_sock_count]; + fwd->listen_sock_count += num; + for (port = new->first; port <= new->last; port++) { - /* Fill in the legacy data structures to match the table */ + new->socks[port - new->first] = -1;
It's probably saner to initialise these, but just to confirm my understanding: this is not needed as (unlike what you have for rules) the array is always compacted and its used length described, right?
+ + /* Fill in the legacy forwarding data structures to match the table */ if (!(new->flags & FWD_SCAN)) bitmap_set(fwd->map, port); fwd->delta[port] = new->to - new->first; diff --git a/fwd.h b/fwd.h index 94869c2a..3ddcb91d 100644 --- a/fwd.h +++ b/fwd.h @@ -23,6 +23,7 @@ bool fwd_port_is_ephemeral(in_port_t port); * @first: First port number to forward * @last: Last port number to forward * @to: Port number to forward port @first to. + * @socks: Array of listening sockets for this entry * @flags: Flag mask * FWD_DUAL_STACK_ANY - match any IPv4 or IPv6 address (@addr should be ::) * FWD_WEAK - Don't give an error if binds fail for some forwards @@ -34,6 +35,7 @@ struct fwd_rule { union inany_addr addr; char ifname[IFNAMSIZ]; in_port_t first, last, to; + int *socks; #define FWD_DUAL_STACK_ANY BIT(0) #define FWD_WEAK BIT(1) #define FWD_SCAN BIT(2) @@ -65,6 +67,13 @@ enum fwd_ports_mode {
#define PORT_BITMAP_SIZE DIV_ROUND_UP(NUM_PORTS, 8)
+/* Maximum number of listening sockets (per pif & protocol) + * + * Rationale: This lets us listen on every port for two addresses (which we need + * for -T auto without SO_BINDTODEVICE), plus a comfortable number of extras. + */ +#define MAX_LISTEN_SOCKS (NUM_PORTS * 3) + /** * fwd_ports() - Describes port forwarding for one protocol and direction * @mode: Overall forwarding mode (all, none, auto, specific ports) @@ -74,6 +83,8 @@ enum fwd_ports_mode { * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between the original destination and mapped port number + * @listen_sock_count: Number of entries used in @listen_socks + * @listen_socks: Listening sockets for forwarding
To keep those aligned: /** * fwd_ports() - Describes port forwarding for one protocol and direction * @mode: Overall forwarding mode (all, none, auto, some ports) * @scan4: /proc/net fd to scan for IPv4 ports when in AUTO mode * @scan6: /proc/net fd to scan for IPv6 ports when in AUTO mode * @count: Number of forwarding rules * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between original and mapped destination port * @listen_sock_count: Number of entries used in @listen_socks * @listen_socks: Listening sockets for forwarding */
*/ struct fwd_ports { enum fwd_ports_mode mode; @@ -83,6 +94,8 @@ struct fwd_ports { struct fwd_rule rules[MAX_FWD_RULES]; uint8_t map[PORT_BITMAP_SIZE]; in_port_t delta[NUM_PORTS]; + unsigned listen_sock_count; + int listen_socks[MAX_LISTEN_SOCKS]; };
#define FWD_PORT_SCAN_INTERVAL 1000 /* ms */
-- Stefano
On Thu, 8 Jan 2026 13:29:41 +1100
David Gibson
Previously we created inbound listening sockets as we parsed the forwarding options (-t, -u) whereas outbound listening sockets were created during {tcp,udp}_init(). Now that we have a data structure recording the full details of the listening options we can move all listening socket creation to {tcp,udp}_init(). This means that errors for either direction are detected and reported the same way.
Introduce fwd_listen_sync() which synchronizes the state of listening sockets to the forwarding rules table, both for fixed and automatic forwards.
This does cause a change in semantics for "exclude only" port specifications. Previously an option like -t ~6000 wouldn't cause a fatal error, as long as we could bind at least one port. Now, it requires at least one port for each generated rule; that is for each of the contiguous blocks of ports the specification resolves to. With typical ephemeral ports settings that's one port each in 1..5999, 6001..32767 and 61000..65535.
Preserving the exact behaviour for this case would require a considerably more complex data structure, so I'm hoping this is a sufficiently niche case for the change to be acceptable.
I guess so too, I wouldn't really worry. Well, I'm not sure if it works, but one relatively simple idea could be to have a "with_prev" bit in the rule struct representing the fact that the current rule was derived from the same port specification as the previous rule, which implies they would need to be deleted all together (but we can happily enforce that). Then, in the fwd_listen_sync_() loop, before reporting failure, you would check the next entry: if the "with_prev" bit is set, report failure only if we fail (keeping a local boolean flag) for all the entries up to the first one with "with_prev" unset. I would be inclined to say it's worth it if it's that simple, but I haven't tried, so I might be very well missing something.
Signed-off-by: David Gibson
--- conf.c | 27 ---------- fwd.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-- fwd.h | 3 ++ ip.c | 1 - tcp.c | 122 ++--------------------------------------- tcp.h | 1 - udp.c | 99 +++------------------------------- udp.h | 1 - 8 files changed, 177 insertions(+), 244 deletions(-) diff --git a/conf.c b/conf.c index 0bcf80d7..57693b3f 100644 --- a/conf.c +++ b/conf.c @@ -148,9 +148,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname, uint8_t flags) { unsigned delta = to - first; - bool bound_one = false; unsigned base, i; - int fd;
if (first == 0) { die("Can't forward port 0 for option '-%c %s'", @@ -179,28 +177,6 @@ static void conf_ports_range_except(const struct ctx *c, char optname, warn( "Altering mapping of already mapped port number: %s", optarg); } - - if (!(flags & FWD_SCAN) && optname == 't') - fd = tcp_listen(c, PIF_HOST, addr, ifname, i); - else if (!(flags & FWD_SCAN) && optname == 'u') - fd = udp_listen(c, PIF_HOST, addr, ifname, i); - else - /* No way to check in advance for -T and -U */ - fd = 0; - - if (fd == -ENFILE || fd == -EMFILE) { - die( -"Can't open enough sockets for port specifier: %s", - optarg); - } - - if (fd >= 0) { - bound_one = true; - } else if (!(flags & FWD_WEAK)) { - die( -"Failed to bind port %u (%s) for option '-%c %s'", - i, strerror_(-fd), optname, optarg); - } }
if ((optname == 'T' || optname == 'U') && c->no_bindtodevice) { @@ -226,9 +202,6 @@ static void conf_ports_range_except(const struct ctx *c, char optname, } base = i - 1; } - - if (!bound_one) - die("Failed to bind any port for '-%c %s'", optname, optarg); }
/** diff --git a/fwd.c b/fwd.c index f27a4220..70ef73a3 100644 --- a/fwd.c +++ b/fwd.c @@ -22,6 +22,7 @@ #include
#include "util.h" +#include "epoll_ctl.h" #include "ip.h" #include "siphash.h" #include "inany.h" @@ -420,6 +421,160 @@ void fwd_rules_print(const struct fwd_ports *fwd) } }
+/** fwd_sync_one() - Create or remove listening sockets for a forward entry + * @c: Execution context + * @rule: Forwarding rule + * @pif: Interface to create listening sockets for + * @proto: Protocol to listen for + * @scanmap: Bitmap of ports to listen for on FWD_SCAN entries + * + * Return: 0 on success, -1 on failure + */ +static int fwd_sync_one(const struct ctx *c, const struct fwd_rule *rule, + uint8_t pif, uint8_t proto, const uint8_t *scanmap) +{ + const union inany_addr *addr = fwd_rule_addr(rule); + const char *ifname = rule->ifname; + bool bound_one = false; + unsigned port; + + ASSERT(pif_is_socket(pif)); + + if (!*ifname) + ifname = NULL; + + for (port = rule->first; port <= rule->last; port++) { + int fd = rule->socks[port - rule->first]; + + if ((rule->flags & FWD_SCAN) && !bitmap_isset(scanmap, port)) { + /* We don't want to listen on this port */ + if (fd >= 0) { + /* We already are, so stop */ + epoll_del(c->epollfd, fd); + close(fd); + rule->socks[port - rule->first] = -1; + } + continue; + } + + if (fd >= 0) /* Already listening, nothing to do */ { + bound_one = true; + continue; + } + + if (proto == IPPROTO_TCP) + fd = tcp_listen(c, pif, addr, ifname, port); + else if (proto == IPPROTO_UDP) + fd = udp_listen(c, pif, addr, ifname, port); + else + ASSERT(0); + + if (fd < 0) { + char astr[INANY_ADDRSTRLEN] = "";
Should we perhaps make this "*" for consistency with fwd_rules_print()?
+ + if (addr) + inany_ntop(addr, astr, sizeof(astr)); + + warn("Listen failed for %s %s port %s%s%s%s%u: %s", + pif_name(pif), ipproto_name(proto), + astr, ifname ? "%" : "", ifname ? ifname : "", + addr || ifname ? "/" : "", port, strerror_(-fd)); + + if (!(rule->flags & FWD_WEAK)) + return -1; + + continue; + } + + rule->socks[port - rule->first] = fd; + bound_one = true; + } + + if (!bound_one && !(rule->flags & FWD_SCAN)) { + char astr[INANY_ADDRSTRLEN] = "";
Same here.
+ + if (addr) + inany_ntop(addr, astr, sizeof(astr)); + + warn("All listens failed for %s %s %s%s%s%s%u-%u", + pif_name(pif), ipproto_name(proto), + astr, ifname ? "%" : "", ifname ? ifname : "", + addr || ifname ? "/" : "", rule->first, rule->last); + return -1; + } + + return 0; +} + +/** struct fwd_listen_args - arguments for fwd_listen_init_() + * @c: Execution context + * @fwd: Forwarding information + * @scanmap: Bitmap of ports to auto-forward + * @pif: Interface to create listening sockets for + * @proto: Protocol + * @ret: Return code + */ +struct fwd_listen_args { + const struct ctx *c; + const struct fwd_ports *fwd; + const uint8_t *scanmap; + uint8_t pif; + uint8_t proto; + int ret; +}; + +/** fwd_listen_sync_() - Update listening sockets to match forwards + * @arg: struct fwd_listen_args with arguments + * + * Returns: zero + */ +static int fwd_listen_sync_(void *arg) +{ + struct fwd_listen_args *a = arg; + unsigned i; + + if (a->pif == PIF_SPLICE) + ns_enter(a->c); + + for (i = 0; i < a->fwd->count; i++) { + a->ret = fwd_sync_one(a->c, &a->fwd->rules[i], + a->pif, a->proto, a->fwd->map); + if (a->ret < 0) + break; + } + + return 0; +} + +/** fwd_listen_sync() - Update listening sockets to match forwards
This has the same description as fwd_listen_sync_() and it might be quite hard to understand the difference if one is not used to spot the "void *arg" argument. What about: /** fwd_listen_sync() - Call fwd_listen_sync_() in the intended namespace ?
+ * @c: Execution context + * @fwd: Forwarding information + * @pif: Interface to create listening sockets for + * @proto: Protocol + * + * Return: 0 on success, -1 on failure + */ +int fwd_listen_sync(const struct ctx *c, const struct fwd_ports *fwd, + uint8_t pif, uint8_t proto) +{ + struct fwd_listen_args a = { + .c = c, .fwd = fwd, .pif = pif, .proto = proto, + }; + + if (pif == PIF_SPLICE) + NS_CALL(fwd_listen_sync_, &a); + else + fwd_listen_sync_(&a); + + if (a.ret < 0) { + err("Couldn't listen on requested %s ports", + ipproto_name(proto)); + return -1; + } + + return 0; +} + /* See enum in kernel's include/net/tcp_states.h */ #define UDP_LISTEN 0x07 #define TCP_LISTEN 0x0a @@ -578,10 +733,14 @@ void fwd_scan_ports_timer(struct ctx *c, const struct timespec *now)
fwd_scan_ports(c);
- if (!c->no_tcp) - tcp_port_rebind_all(c); - if (!c->no_udp) - udp_port_rebind_all(c); + if (!c->no_tcp) { + fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP); + fwd_listen_sync(c, &c->tcp.fwd_out, PIF_SPLICE, IPPROTO_TCP); + } + if (!c->no_udp) { + fwd_listen_sync(c, &c->udp.fwd_in, PIF_HOST, IPPROTO_UDP); + fwd_listen_sync(c, &c->udp.fwd_out, PIF_SPLICE, IPPROTO_UDP); + } }
/** diff --git a/fwd.h b/fwd.h index 3ddcb91d..f84e7c01 100644 --- a/fwd.h +++ b/fwd.h @@ -108,6 +108,9 @@ void fwd_rules_print(const struct fwd_ports *fwd); void fwd_scan_ports_init(struct ctx *c); void fwd_scan_ports_timer(struct ctx * c, const struct timespec *now);
+int fwd_listen_sync(const struct ctx *c, const struct fwd_ports *fwd, + uint8_t pif, uint8_t proto); + bool nat_inbound(const struct ctx *c, const union inany_addr *addr, union inany_addr *translated); uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, diff --git a/ip.c b/ip.c index f1d224bd..fc26dab2 100644 --- a/ip.c +++ b/ip.c @@ -78,7 +78,6 @@ found: * /etc/protocols and might allocate, which isn't possible for us once * self-isolated. */ -/* cppcheck-suppress unusedFunction */ const char *ipproto_name(uint8_t proto) { switch (proto) { diff --git a/tcp.c b/tcp.c index 57faed4b..976f0ab7 100644 --- a/tcp.c +++ b/tcp.c @@ -2732,50 +2732,6 @@ int tcp_listen(const struct ctx *c, uint8_t pif, return s; }
-/** - * tcp_ns_listen() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void tcp_ns_listen(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_tcp); - - if (!c->no_bindtodevice) { - tcp_listen(c, PIF_SPLICE, NULL, "lo", port); - return; - } - - if (c->ifi4) - tcp_listen(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - tcp_listen(c, PIF_SPLICE, &inany_loopback6, NULL, port); -} - -/** - * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections - * @arg: Execution context - * - * Return: 0 - */ -/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */ -static int tcp_ns_socks_init(void *arg) -{ - const struct ctx *c = (const struct ctx *)arg; - unsigned port; - - ns_enter(c); - - for (port = 0; port < NUM_PORTS; port++) { - if (!bitmap_isset(c->tcp.fwd_out.map, port)) - continue; - - tcp_ns_listen(c, port); - } - - return 0; -} - /** * tcp_sock_refill_pool() - Refill one pool of pre-opened sockets * @pool: Pool of sockets to refill @@ -2919,10 +2875,13 @@ int tcp_init(struct ctx *c)
tcp_sock_refill_init(c);
+ if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP) < 0) + return -1;
This needs an update to the function comment (which currently says "Return: 0, doesn't return on failure"). * Return: 0, doesn't return on failure
if (c->mode == MODE_PASTA) { tcp_splice_init(c); - - NS_CALL(tcp_ns_socks_init, c); + if (fwd_listen_sync(c, &c->tcp.fwd_out, + PIF_SPLICE, IPPROTO_TCP) < 0) + return -1; }
peek_offset_cap = (!c->ifi4 || tcp_probe_peek_offset_cap(AF_INET)) && @@ -2941,77 +2900,6 @@ int tcp_init(struct ctx *c) return 0; }
-/** - * tcp_port_rebind() - Rebind ports to match forward maps - * @c: Execution context - * @outbound: True to remap outbound forwards, otherwise inbound - * - * Must be called in namespace context if @outbound is true. - */ -static void tcp_port_rebind(struct ctx *c, bool outbound) -{ - const uint8_t *fmap = outbound ? c->tcp.fwd_out.map : c->tcp.fwd_in.map; - int (*socks)[IP_VERSIONS] = outbound ? tcp_sock_ns : tcp_sock_init_ext; - unsigned port; - - for (port = 0; port < NUM_PORTS; port++) { - if (!bitmap_isset(fmap, port)) { - if (socks[port][V4] >= 0) { - close(socks[port][V4]); - socks[port][V4] = -1; - } - - if (socks[port][V6] >= 0) { - close(socks[port][V6]); - socks[port][V6] = -1; - } - - continue; - } - - if ((c->ifi4 && socks[port][V4] == -1) || - (c->ifi6 && socks[port][V6] == -1)) { - if (outbound) - tcp_ns_listen(c, port); - else - tcp_listen(c, PIF_HOST, NULL, NULL, port); - } - } -} - -/** - * tcp_port_rebind_outbound() - Rebind ports in namespace - * @arg: Execution context - * - * Called with NS_CALL() - * - * Return: 0 - */ -static int tcp_port_rebind_outbound(void *arg) -{ - struct ctx *c = (struct ctx *)arg; - - ns_enter(c); - tcp_port_rebind(c, true); - - return 0; -} - -/** - * tcp_port_rebind_all() - Rebind ports to match forward maps (in host & ns) - * @c: Execution context - */ -void tcp_port_rebind_all(struct ctx *c) -{ - ASSERT(c->mode == MODE_PASTA && !c->no_tcp); - - if (c->tcp.fwd_out.mode == FWD_AUTO) - NS_CALL(tcp_port_rebind_outbound, c); - - if (c->tcp.fwd_in.mode == FWD_AUTO) - tcp_port_rebind(c, false); -} - /** * tcp_timer() - Periodic tasks: port detection, closed connections, pool refill * @c: Execution context diff --git a/tcp.h b/tcp.h index ef1e3544..45f97d93 100644 --- a/tcp.h +++ b/tcp.h @@ -22,7 +22,6 @@ int tcp_listen(const struct ctx *c, uint8_t pif, const union inany_addr *addr, const char *ifname, in_port_t port); int tcp_init(struct ctx *c); -void tcp_port_rebind_all(struct ctx *c); void tcp_timer(const struct ctx *c, const struct timespec *now); void tcp_defer_handler(struct ctx *c);
diff --git a/udp.c b/udp.c index d7dcb1d2..7c5546df 100644 --- a/udp.c +++ b/udp.c @@ -1203,98 +1203,6 @@ static void udp_splice_iov_init(void) } }
-/** - * udp_ns_listen() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void udp_ns_listen(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_udp); - - if (!c->no_bindtodevice) { - udp_listen(c, PIF_SPLICE, NULL, "lo", port); - return; - } - - if (c->ifi4) - udp_listen(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - udp_listen(c, PIF_SPLICE, &inany_loopback6, NULL, port); -} - -/** - * udp_port_rebind() - Rebind ports to match forward maps - * @c: Execution context - * @outbound: True to remap outbound forwards, otherwise inbound - * - * Must be called in namespace context if @outbound is true. - */ -static void udp_port_rebind(struct ctx *c, bool outbound) -{ - int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; - const uint8_t *fmap - = outbound ? c->udp.fwd_out.map : c->udp.fwd_in.map; - unsigned port; - - for (port = 0; port < NUM_PORTS; port++) { - if (!bitmap_isset(fmap, port)) { - if (socks[V4][port] >= 0) { - close(socks[V4][port]); - socks[V4][port] = -1; - } - - if (socks[V6][port] >= 0) { - close(socks[V6][port]); - socks[V6][port] = -1; - } - - continue; - } - - if ((c->ifi4 && socks[V4][port] == -1) || - (c->ifi6 && socks[V6][port] == -1)) { - if (outbound) - udp_ns_listen(c, port); - else - udp_listen(c, PIF_HOST, NULL, NULL, port); - } - } -} - -/** - * udp_port_rebind_outbound() - Rebind ports in namespace - * @arg: Execution context - * - * Called with NS_CALL() - * - * Return: 0 - */ -static int udp_port_rebind_outbound(void *arg) -{ - struct ctx *c = (struct ctx *)arg; - - ns_enter(c); - udp_port_rebind(c, true); - - return 0; -} - -/** - * udp_port_rebind_all() - Rebind ports to match forward maps (in host & ns) - * @c: Execution context - */ -void udp_port_rebind_all(struct ctx *c) -{ - ASSERT(c->mode == MODE_PASTA && !c->no_udp); - - if (c->udp.fwd_out.mode == FWD_AUTO) - NS_CALL(udp_port_rebind_outbound, c); - - if (c->udp.fwd_in.mode == FWD_AUTO) - udp_port_rebind(c, false); -} - /** * udp_init() - Initialise per-socket data, and sockets in namespace * @c: Execution context @@ -1307,9 +1215,14 @@ int udp_init(struct ctx *c)
udp_iov_init(c);
+ if (fwd_listen_sync(c, &c->udp.fwd_in, PIF_HOST, IPPROTO_UDP) < 0) + return -1;
Same here, update to the function comment needed.
+ if (c->mode == MODE_PASTA) { udp_splice_iov_init(); - NS_CALL(udp_port_rebind_outbound, c); + if (fwd_listen_sync(c, &c->udp.fwd_out, + PIF_SPLICE, IPPROTO_UDP) < 0) + return -1; }
return 0; diff --git a/udp.h b/udp.h index 94c698e2..73efe036 100644 --- a/udp.h +++ b/udp.h @@ -19,7 +19,6 @@ int udp_listen(const struct ctx *c, uint8_t pif, const union inany_addr *addr, const char *ifname, in_port_t port); int udp_init(struct ctx *c); -void udp_port_rebind_all(struct ctx *c); void udp_update_l2_buf(const unsigned char *eth_d);
/**
-- Stefano
On Thu, 8 Jan 2026 13:29:43 +1100
David Gibson
It's possible for a user to supply conflicting forwarding parameters, e.g. $ pasta -t 80:8080 -t 127.0.0.1/80:8888
We give a warning in this case, but it's based on the legacy forwarding bitmaps. This is too strict, because it will also warn on cases that shouldn't conflict because they use different addresses, e.g. $ pasta -t 192.0.2.1/80:8080 127.0.0.1/80:8888
Theoretically, it's also too loose because it won't take into account auto-scan forwarding rules. We can't hit that in practice now, because we only ever have one auto-scan rule and nothing else, but we want to remove that restriction in future.
Replace the bitmap based check with a check based on actually scanning the forwarding rules for conflicts.
Signed-off-by: David Gibson
--- conf.c | 5 ----- fwd.c | 21 ++++++++++++++++++++- inany.c | 19 +++++++++++++++++++ inany.h | 1 + 4 files changed, 40 insertions(+), 6 deletions(-) diff --git a/conf.c b/conf.c index 725cf88b..e8d6d5d9 100644 --- a/conf.c +++ b/conf.c @@ -172,11 +172,6 @@ static void conf_ports_range_except(const struct ctx *c, char optname, for (i = base; i <= last; i++) { if (exclude && bitmap_isset(exclude, i)) break; - - if (bitmap_isset(fwd->map, i)) { - warn( -"Altering mapping of already mapped port number: %s", optarg); - } }
if ((optname == 'T' || optname == 'U') && c->no_bindtodevice) { diff --git a/fwd.c b/fwd.c index 70ef73a3..5208155b 100644 --- a/fwd.c +++ b/fwd.c @@ -348,7 +348,7 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, const uint8_t allowed_flags = FWD_WEAK | FWD_SCAN; unsigned num = (unsigned)last - first + 1; struct fwd_rule *new; - unsigned port; + unsigned i, port;
ASSERT(!(flags & ~allowed_flags));
@@ -357,6 +357,25 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, if ((fwd->listen_sock_count + num) > ARRAY_SIZE(fwd->listen_socks)) die("Too many listening sockets");
+ /* Check for any conflicting entries */ + for (i = 0; i < fwd->count; i++) { + char newstr[INANY_ADDRSTRLEN], rulestr[INANY_ADDRSTRLEN]; + struct fwd_rule *rule = &fwd->rules[i]; + + if (!inany_matches(addr, fwd_rule_addr(rule))) + /* Non-conflicting addresses */ + continue; + + if (last < rule->first || rule->last < first) + /* Port ranges don't overlap */ + continue; + + die("Forwarding configuration conflict: %s/%u-%u versus %s/%u-%u", + inany_ntop(addr, newstr, sizeof(newstr)), first, last, + inany_ntop(fwd_rule_addr(rule), rulestr, sizeof(rulestr)), + rule->first, rule->last);
Same as comments to earlier places in fwd_rule_add(): we'll eventually trigger this from a client so we should eventually report failure rather than quitting.
+ } + new = &fwd->rules[fwd->count++]; new->flags = flags;
diff --git a/inany.c b/inany.c index 87a4d8b6..a8c44237 100644 --- a/inany.c +++ b/inany.c @@ -21,6 +21,25 @@ const union inany_addr inany_loopback4 = INANY_INIT4(IN4ADDR_LOOPBACK_INIT); const union inany_addr inany_any4 = INANY_INIT4(IN4ADDR_ANY_INIT);
+/** inany_matches - Do two addresses match
Nit: "Do [...] match?"
+ * @a, @b: IPv[46] addresses (NULL for 0.0.0.0 & ::) + * + * Return: true if they match, false otherwise + * + * Addresses match themselves, but also with unspecified addresses of the same + * family. + */ +bool inany_matches(const union inany_addr *a, const union inany_addr *b) +{ + if (!a || !b) + return true; + + if (inany_is_unspecified(a) || inany_is_unspecified(b)) + return !!inany_v4(a) == !!inany_v4(b); + + return inany_equals(a, b); +} + /** inany_ntop - Convert an IPv[46] address to text format * @src: IPv[46] address (NULL for unspecified) * @dst: output buffer, minimum INANY_ADDRSTRLEN bytes diff --git a/inany.h b/inany.h index 61b36fb4..b02c2891 100644 --- a/inany.h +++ b/inany.h @@ -293,6 +293,7 @@ static inline void inany_siphash_feed(struct siphash_state *state,
#define INANY_ADDRSTRLEN MAX(INET_ADDRSTRLEN, INET6_ADDRSTRLEN)
+bool inany_matches(const union inany_addr *a, const union inany_addr *b); const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size); int inany_pton(const char *src, union inany_addr *dst);
I reviewed up to 10/14 only (I have no comments on that one), I still need a bit of time for the remaining four patches. -- Stefano
On Tue, Jan 13, 2026 at 12:26:10AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:36 +1100 David Gibson
wrote: At present, we set up forwarding as we parse the -t, and -u options, not keeping a persistent data structure with all the details. We do have some information in struct fwd_ports, but it doesn't capture all the nuance that the options do.
As a first step to generalising our forwarding model, add a table of all the forwarding rules to struct fwd_ports. For now it covers only explicit forwards, not automatic, and we don't do anything with it other than print some additional debug information. We'll do more with it in future patches.
Signed-off-by: David Gibson
--- conf.c | 80 ++++++++++++++++++++++++++++++------------------- fwd.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 35 +++++++++++++++++++++- 3 files changed, 179 insertions(+), 31 deletions(-) diff --git a/conf.c b/conf.c index c936664b..127e69f5 100644 --- a/conf.c +++ b/conf.c @@ -137,7 +137,7 @@ static int parse_port_range(const char *s, char **endptr, * @last: Last port to forward * @exclude: Bitmap of ports to exclude * @to: Port to translate @first to when forwarding - * @weak: Ignore errors, as long as at least one port is mapped + * @flags: Flags for forwarding entries */ static void conf_ports_range_except(const struct ctx *c, char optname, const char *optarg, struct fwd_ports *fwd, @@ -145,10 +145,11 @@ static void conf_ports_range_except(const struct ctx *c, char optname, const char *ifname, uint16_t first, uint16_t last, const uint8_t *exclude, uint16_t to, - bool weak) + uint8_t flags) { + unsigned delta = to - first; bool bound_one = false; - unsigned i; + unsigned base, i; int fd;
if (first == 0) { @@ -172,37 +173,45 @@ static void conf_ports_range_except(const struct ctx *c, char optname, } }
- for (i = first; i <= last; i++) { - if (bitmap_isset(exclude, i)) + for (base = first; base <= last; base++) { + if (bitmap_isset(exclude, base)) continue;
- if (bitmap_isset(fwd->map, i)) { - warn( -"Altering mapping of already mapped port number: %s", optarg); - } + for (i = base; i <= last; i++) {
This nested loop is a bit convoluted, but I don't have better ideas and I suppose it goes away later (I just reached 10/14 so far).
The nested loop remains, so far. A lot of the rest of the loop body does go away, so the overall complexity does reduce.
+ if (bitmap_isset(exclude, i)) + break;
- bitmap_set(fwd->map, i); - fwd->delta[i] = to - first; + if (bitmap_isset(fwd->map, i)) { + warn( +"Altering mapping of already mapped port number: %s", optarg); + }
- if (optname == 't') - fd = tcp_listen(c, PIF_HOST, addr, ifname, i); - else if (optname == 'u') - fd = udp_listen(c, PIF_HOST, addr, ifname, i); - else - /* No way to check in advance for -T and -U */ - fd = 0; + if (optname == 't') + fd = tcp_listen(c, PIF_HOST, addr, ifname, i); + else if (optname == 'u') + fd = udp_listen(c, PIF_HOST, addr, ifname, i); + else + /* No way to check in advance for -T and -U */ + fd = 0; + + if (fd == -ENFILE || fd == -EMFILE) { + die( +"Can't open enough sockets for port specifier: %s", + optarg); + }
- if (fd == -ENFILE || fd == -EMFILE) { - die("Can't open enough sockets for port specifier: %s", - optarg); + if (fd >= 0) { + bound_one = true; + } else if (!(flags & FWD_WEAK)) { + die( +"Failed to bind port %u (%s) for option '-%c %s'", + i, strerror_(-fd), optname, optarg); + } }
- if (fd >= 0) { - bound_one = true; - } else if (!weak) { - die("Failed to bind port %u (%s) for option '-%c %s'", - i, strerror_(-fd), optname, optarg); - } + fwd_rule_add(fwd, flags, addr, ifname, base, i - 1, + base + delta); + base = i - 1; }
if (!bound_one) @@ -272,7 +281,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg, conf_ports_range_except(c, optname, optarg, fwd, NULL, NULL, 1, NUM_PORTS - 1, exclude, - 1, true); + 1, FWD_WEAK); return; }
@@ -357,7 +366,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg, conf_ports_range_except(c, optname, optarg, fwd, addr, ifname, 1, NUM_PORTS - 1, exclude, - 1, true); + 1, FWD_WEAK); return; }
@@ -390,7 +399,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg, addr, ifname, orig_range.first, orig_range.last, exclude, - mapped_range.first, false); + mapped_range.first, 0); } while ((p = next_chunk(p, ',')));
return; @@ -1225,6 +1234,17 @@ dns6: info(" %s", c->dns_search[i].n); } } + + info("Inbound TCP forwarding:"); + fwd_rules_print(&c->tcp.fwd_in); + info("Inbound UDP forwarding:"); + fwd_rules_print(&c->udp.fwd_in); + if (c->mode == MODE_PASTA) { + info("Outbound TCP forwarding:"); + fwd_rules_print(&c->tcp.fwd_out); + info("Outbound UDP forwarding:"); + fwd_rules_print(&c->udp.fwd_out); + } }
/** diff --git a/fwd.c b/fwd.c index 961c533f..ad398e54 100644 --- a/fwd.c +++ b/fwd.c @@ -13,6 +13,7 @@ * Author: David Gibson
*/ +#include
#include #include #include @@ -22,6 +23,8 @@ #include "util.h" #include "ip.h" +#include "siphash.h" +#include "inany.h" #include "fwd.h" #include "passt.h" #include "lineread.h" @@ -300,6 +303,19 @@ parse_err: warn("Unable to parse %s", PORT_RANGE_SYSCTL); }
+/** + * fwd_rule_addr() - Return match address for a rule + * @rule: Forwarding rule + * + * Return: Rule's match address, or NULL, if it matches any address
What about:
matching address for rule, NULL if it matches all addresses
+ */ +static const union inany_addr *fwd_rule_addr(const struct fwd_rule *rule) +{ + if (rule->flags & FWD_DUAL_STACK_ANY) + return NULL;
Nit: usual newline here.
Fixed.
+ return &rule->addr; +} + /** * fwd_port_is_ephemeral() - Is port number ephemeral? * @port: Port number @@ -313,6 +329,85 @@ bool fwd_port_is_ephemeral(in_port_t port) return (port >= fwd_ephemeral_min) && (port <= fwd_ephemeral_max); }
+/** + * fwd_rule_add() - Add a rule to a forwarding table + * @fwd: Table to add to + * @flags: Flags for this entry + * @addr: Our address to forward (NULL for both 0.0.0.0 and ::) + * @ifname: Only forward from this interface name, if non-empty + * @first: First port number to forward + * @last: Last port number to forward + * @to: First port of target port range to map to + */ +void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, + const union inany_addr *addr, const char *ifname, + in_port_t first, in_port_t last, in_port_t to) +{ + /* Flags which can be set from the caller */ + const uint8_t allowed_flags = FWD_WEAK; + struct fwd_rule *new; + unsigned port; + + ASSERT(!(flags & ~allowed_flags)); + + if (fwd->count >= ARRAY_SIZE(fwd->rules)) + die("Too many port forwarding ranges"); + + new = &fwd->rules[fwd->count++]; + new->flags = flags; + + if (addr) { + new->addr = *addr; + } else { + new->addr = inany_any6; + new->flags |= FWD_DUAL_STACK_ANY; + } + + memset(new->ifname, 0, sizeof(new->ifname)); + if (ifname) + strncpy(new->ifname, ifname, sizeof(new->ifname));
Coverity Scan complains because we copy exactly up to sizeof(new->ifname) bytes, and the terminator might be missing:
/home/sbrivio/passt/fwd.c:368:3: Type: Buffer not null terminated (BUFFER_SIZE)
/home/sbrivio/passt/fwd.c:351:2: 1. path: Condition "!(flags & -3 /* ~allowed_flags */)", taking true branch. /home/sbrivio/passt/fwd.c:353:2: 2. path: Condition "fwd->count >= 128U /* (int)(sizeof (fwd->rules) / sizeof (fwd->rules[0])) */", taking false branch. /home/sbrivio/passt/fwd.c:359:2: 3. path: Condition "addr", taking true branch. /home/sbrivio/passt/fwd.c:361:2: 4. path: Falling through to end of if statement. /home/sbrivio/passt/fwd.c:367:2: 5. path: Condition "ifname", taking true branch. /home/sbrivio/passt/fwd.c:368:3: 6. buffer_size_warning: Calling "strncpy" with a maximum size argument of 16 bytes on destination array "new->ifname" of size 16 bytes might leave the destination string unterminated. /home/sbrivio/passt/fwd.c:370:2:
Ah, good point. I added a check.
+ + ASSERT(first <= last); + new->first = first; + new->last = last; + + new->to = to; + + for (port = new->first; port <= new->last; port++) { + /* Fill in the legacy data structures to match the table */ + bitmap_set(fwd->map, port); + fwd->delta[port] = new->to - new->first; + } +} + +/** + * fwd_rules_print() - Print forwarding rules for debugging + * @fwd: Table to print + */ +void fwd_rules_print(const struct fwd_ports *fwd) +{ + unsigned i; + + for (i = 0; i < fwd->count; i++) { + const struct fwd_rule *rule = &fwd->rules[i]; + const char *weak = rule->flags & FWD_WEAK ? " WEAK" : "";
Should we print " might fail" or " can fail" instead of " WEAK"? This is for users.
Good point. THough I'm not sure "might fail" or "can fail" is terribly clear in context either. I've gone with " (best effort)" for now.
+ const char *percent = *rule->ifname ? "%" : ""; + char addr[INANY_ADDRSTRLEN]; + + inany_ntop(fwd_rule_addr(rule), addr, sizeof(addr)); + + if (rule->first == rule->last) { + info(" [%s]%s%s:%hu => %hu %s", + addr, percent, rule->ifname, + rule->first, rule->to, weak); + } else { + info(" [%s]%s%s:%hu-%hu => %hu-%hu %s", + addr, percent, rule->ifname, rule->first, rule->last, + rule->to, rule->last - rule->first + rule->to, weak); + } + } +} + /* See enum in kernel's include/net/tcp_states.h */ #define UDP_LISTEN 0x07 #define TCP_LISTEN 0x0a diff --git a/fwd.h b/fwd.h index 934bab33..3dfc7959 100644 --- a/fwd.h +++ b/fwd.h @@ -16,6 +16,30 @@ struct flowside; void fwd_probe_ephemeral(void); bool fwd_port_is_ephemeral(in_port_t port);
+/** + * struct fwd_rule - Forwarding rule governing a range of ports + * @addr: Address to forward from + * @ifname: Interface to forward from + * @first: First port number to forward + * @last: Last port number to forward + * @to: Port number to forward port @first to. + * @flags: Flag mask + * FWD_DUAL_STACK_ANY - match any IPv4 or IPv6 address (@addr should be ::) + * FWD_WEAK - Don't give an error if binds fail for some forwards
Usually we document bit flags inline as we define them (see struct tcp_tap_conn), I guess it's a bit easier to follow, but this has the advantage of looking less cramped (unless we add comment lines before the #defines, which would also be reasonable I think).
Not a strong preference either way.
Keeping it as-is, for now, since some of the descriptions are quite long so they'd wrap if put on the same line as the #define.
+ * + * FIXME: @addr and @ifname currently ignored for outbound tables + */ +struct fwd_rule { + union inany_addr addr; + char ifname[IFNAMSIZ]; + in_port_t first, last, to;
Having one line for each helps (me at least) visualising the size of the struct. Not a strong preference.
Fair point, altered.
+#define FWD_DUAL_STACK_ANY BIT(0) +#define FWD_WEAK BIT(1) + uint8_t flags; +}; + +#define MAX_FWD_RULES 128 + /** * union fwd_listen_ref - information about a single listening socket * @port: Bound port number of the socket @@ -44,6 +68,8 @@ enum fwd_ports_mode { * @mode: Overall forwarding mode (all, none, auto, specific ports) * @scan4: /proc/net fd to scan for IPv4 ports when in AUTO mode * @scan6: /proc/net fd to scan for IPv6 ports when in AUTO mode + * @count: Number of forwarding rules + * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between the original destination and mapped port number */ @@ -51,14 +77,21 @@ struct fwd_ports { enum fwd_ports_mode mode; int scan4; int scan6; + unsigned count; + struct fwd_rule rules[MAX_FWD_RULES]; uint8_t map[PORT_BITMAP_SIZE]; in_port_t delta[NUM_PORTS]; };
#define FWD_PORT_SCAN_INTERVAL 1000 /* ms */
+void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, + const union inany_addr *addr, const char *ifname, + in_port_t first, in_port_t last, in_port_t to); +void fwd_rules_print(const struct fwd_ports *fwd); + void fwd_scan_ports_init(struct ctx *c); -void fwd_scan_ports_timer(struct ctx *c, const struct timespec *now); +void fwd_scan_ports_timer(struct ctx * c, const struct timespec *now);
bool nat_inbound(const struct ctx *c, const union inany_addr *addr, union inany_addr *translated);
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 12:26:15AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:38 +1100 David Gibson
wrote: Currently the forwarding table records details of explicit port forwards, but nothing for -[tTuU] auto. That looks a little odd on the debug output, and will be a problem for future changes.
Extend the forward table to have rules for auto-scanned forwards, using a new FWD_SCAN flag. For now the mechanism of auto port forwarding isn't updated, and we will only create a single FWD_SCAN rule per protocol and direction. We'll better integrate auto scanning with other forward table mechanics in future.
Signed-off-by: David Gibson
--- conf.c | 31 ++++++++++++++++++++++++++----- fwd.c | 18 +++++++++++------- fwd.h | 2 ++ 3 files changed, 39 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index b486fefe..0bcf80d7 100644 --- a/conf.c +++ b/conf.c @@ -135,7 +135,7 @@ static int parse_port_range(const char *s, char **endptr, * @ifname: Listening interface * @first: First port to forward * @last: Last port to forward - * @exclude: Bitmap of ports to exclude + * @exclude: Bitmap of ports to exclude (may be NULL) * @to: Port to translate @first to when forwarding * @flags: Flags for forwarding entries */ @@ -168,11 +168,11 @@ static void conf_ports_range_except(const struct ctx *c, char optname, }
for (base = first; base <= last; base++) { - if (bitmap_isset(exclude, base)) + if (exclude && bitmap_isset(exclude, base)) continue;
for (i = base; i <= last; i++) { - if (bitmap_isset(exclude, i)) + if (exclude && bitmap_isset(exclude, i)) break;
if (bitmap_isset(fwd->map, i)) { @@ -180,9 +180,9 @@ static void conf_ports_range_except(const struct ctx *c, char optname, "Altering mapping of already mapped port number: %s", optarg); }
- if (optname == 't') + if (!(flags & FWD_SCAN) && optname == 't') fd = tcp_listen(c, PIF_HOST, addr, ifname, i); - else if (optname == 'u') + else if (!(flags & FWD_SCAN) && optname == 'u') fd = udp_listen(c, PIF_HOST, addr, ifname, i); else /* No way to check in advance for -T and -U */ @@ -2202,6 +2202,27 @@ void conf(struct ctx *c, int argc, char **argv) if (!c->udp.fwd_out.mode) c->udp.fwd_out.mode = fwd_default;
+ if (c->tcp.fwd_in.mode == FWD_AUTO) { + conf_ports_range_except(c, 't', "auto", &c->tcp.fwd_in, + NULL, NULL, 1, NUM_PORTS - 1, + NULL, 1, FWD_SCAN); + } + if (c->tcp.fwd_out.mode == FWD_AUTO) { + conf_ports_range_except(c, 'T', "auto", &c->tcp.fwd_out, + NULL, "lo", 1, NUM_PORTS - 1, + NULL, 1, FWD_SCAN); + } + if (c->udp.fwd_in.mode == FWD_AUTO) { + conf_ports_range_except(c, 'u', "auto", &c->udp.fwd_in, + NULL, NULL, 1, NUM_PORTS - 1, + NULL, 1, FWD_SCAN); + } + if (c->udp.fwd_out.mode == FWD_AUTO) { + conf_ports_range_except(c, 'U', "auto", &c->udp.fwd_out, + NULL, "lo", 1, NUM_PORTS - 1, + NULL, 1, FWD_SCAN); + } + if (!c->quiet) conf_print(c); } diff --git a/fwd.c b/fwd.c index ad398e54..69aca441 100644 --- a/fwd.c +++ b/fwd.c @@ -344,7 +344,7 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, in_port_t first, in_port_t last, in_port_t to) { /* Flags which can be set from the caller */ - const uint8_t allowed_flags = FWD_WEAK; + const uint8_t allowed_flags = FWD_WEAK | FWD_SCAN; struct fwd_rule *new; unsigned port;
@@ -375,7 +375,8 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags,
for (port = new->first; port <= new->last; port++) { /* Fill in the legacy data structures to match the table */ - bitmap_set(fwd->map, port); + if (!(new->flags & FWD_SCAN)) + bitmap_set(fwd->map, port); fwd->delta[port] = new->to - new->first; } } @@ -391,19 +392,22 @@ void fwd_rules_print(const struct fwd_ports *fwd) for (i = 0; i < fwd->count; i++) { const struct fwd_rule *rule = &fwd->rules[i]; const char *weak = rule->flags & FWD_WEAK ? " WEAK" : ""; + const char *scan = rule->flags & FWD_SCAN ? " AUTO" : ""; const char *percent = *rule->ifname ? "%" : ""; char addr[INANY_ADDRSTRLEN];
inany_ntop(fwd_rule_addr(rule), addr, sizeof(addr));
if (rule->first == rule->last) { - info(" [%s]%s%s:%hu => %hu %s", + info(" [%s]%s%s:%hu => %hu %s%s", addr, percent, rule->ifname, - rule->first, rule->to, weak); + rule->first, rule->to, weak, scan); } else { - info(" [%s]%s%s:%hu-%hu => %hu-%hu %s", - addr, percent, rule->ifname, rule->first, rule->last, - rule->to, rule->last - rule->first + rule->to, weak); + info(" [%s]%s%s:%hu-%hu => %hu-%hu %s%s", + addr, percent, rule->ifname, + rule->first, rule->last, + rule->to, rule->last - rule->first + rule->to, + weak, scan); } } } diff --git a/fwd.h b/fwd.h index 3dfc7959..94869c2a 100644 --- a/fwd.h +++ b/fwd.h @@ -26,6 +26,7 @@ bool fwd_port_is_ephemeral(in_port_t port); * @flags: Flag mask * FWD_DUAL_STACK_ANY - match any IPv4 or IPv6 address (@addr should be ::) * FWD_WEAK - Don't give an error if binds fail for some forwards + * FWD_SCAN - Only forward if we scan a listener on the target
Nit: I guess this could be slightly more descriptive than "scan a listener", mentioning for example "if the corresponding port is bound on the target".
Done.
* * FIXME: @addr and @ifname currently ignored for outbound tables */ @@ -35,6 +36,7 @@ struct fwd_rule { in_port_t first, last, to; #define FWD_DUAL_STACK_ANY BIT(0) #define FWD_WEAK BIT(1) +#define FWD_SCAN BIT(2) uint8_t flags; };
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 12:26:22AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:39 +1100 David Gibson
wrote: At present, we don't keep track of the fds for listening sockets (except for "auto" ones). Since the fd is stored in the epoll reference, we didn't need an alternative source of it for the various handlers.
However, we're intending to allow dynamic changes to forwarding configuration in future. That means we need a way to enumerate sockets so we can close them on removal of a forward.
Extend our forwarding table data structure to make space for all the listening sockets. To avoid allocation, this imposes another limit: we could run out of space for socket fds before we run out of slots for forwarding rules.
We don't actually do anything with the allocate spaced yet. For "auto" forwards it's redundant with existing arrays. We'll fix both of those in later patches.
Signed-off-by: David Gibson
--- fwd.c | 10 +++++++++- fwd.h | 13 +++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/fwd.c b/fwd.c index 69aca441..f27a4220 100644 --- a/fwd.c +++ b/fwd.c @@ -345,6 +345,7 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, { /* Flags which can be set from the caller */ const uint8_t allowed_flags = FWD_WEAK | FWD_SCAN; + unsigned num = (unsigned)last - first + 1; struct fwd_rule *new; unsigned port;
@@ -352,6 +353,8 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags,
if (fwd->count >= ARRAY_SIZE(fwd->rules)) die("Too many port forwarding ranges"); + if ((fwd->listen_sock_count + num) > ARRAY_SIZE(fwd->listen_socks)) + die("Too many listening sockets");
Here, and above: we plan to trigger this from a client at runtime, and if there are too many listening sockets, or too many rules/ranges, we should fail and report failure (ENOBUFS I guess) instead of quitting.
I'm aware, but for now, these errors are always fatal, so I was going to defer the error plumbing until later. It's not really any harder, and this keeps things a bit simpler while we establish the basic structure of the table.
new = &fwd->rules[fwd->count++]; new->flags = flags; @@ -373,8 +376,13 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags,
new->to = to;
+ new->socks = &fwd->listen_socks[fwd->listen_sock_count]; + fwd->listen_sock_count += num; + for (port = new->first; port <= new->last; port++) { - /* Fill in the legacy data structures to match the table */ + new->socks[port - new->first] = -1;
It's probably saner to initialise these, but just to confirm my understanding: this is not needed as (unlike what you have for rules) the array is always compacted and its used length described, right?
It is needed. Later patches use the fd array to determine if a specific port is _actually_ listening, or just maybe listening. This matters for FWD_WEAK and FWD_SCAN entries. It's nicer to keep it consistent for regular mappings too - that allows fwd_sync_one() to know if it has to do anything or not. As for compaction: I do intend to keep the subarrays used for each rule compacted together, although we don't get allow rules to be deleted, so there's nothing to be done for that yet. I'm not compacting within the subarray, and have no plans to do so at present.
+ + /* Fill in the legacy forwarding data structures to match the table */ if (!(new->flags & FWD_SCAN)) bitmap_set(fwd->map, port); fwd->delta[port] = new->to - new->first; diff --git a/fwd.h b/fwd.h index 94869c2a..3ddcb91d 100644 --- a/fwd.h +++ b/fwd.h @@ -23,6 +23,7 @@ bool fwd_port_is_ephemeral(in_port_t port); * @first: First port number to forward * @last: Last port number to forward * @to: Port number to forward port @first to. + * @socks: Array of listening sockets for this entry * @flags: Flag mask * FWD_DUAL_STACK_ANY - match any IPv4 or IPv6 address (@addr should be ::) * FWD_WEAK - Don't give an error if binds fail for some forwards @@ -34,6 +35,7 @@ struct fwd_rule { union inany_addr addr; char ifname[IFNAMSIZ]; in_port_t first, last, to; + int *socks; #define FWD_DUAL_STACK_ANY BIT(0) #define FWD_WEAK BIT(1) #define FWD_SCAN BIT(2) @@ -65,6 +67,13 @@ enum fwd_ports_mode {
#define PORT_BITMAP_SIZE DIV_ROUND_UP(NUM_PORTS, 8)
+/* Maximum number of listening sockets (per pif & protocol) + * + * Rationale: This lets us listen on every port for two addresses (which we need + * for -T auto without SO_BINDTODEVICE), plus a comfortable number of extras. + */ +#define MAX_LISTEN_SOCKS (NUM_PORTS * 3) + /** * fwd_ports() - Describes port forwarding for one protocol and direction * @mode: Overall forwarding mode (all, none, auto, specific ports) @@ -74,6 +83,8 @@ enum fwd_ports_mode { * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between the original destination and mapped port number + * @listen_sock_count: Number of entries used in @listen_socks + * @listen_socks: Listening sockets for forwarding
To keep those aligned:
/** * fwd_ports() - Describes port forwarding for one protocol and direction * @mode: Overall forwarding mode (all, none, auto, some ports) * @scan4: /proc/net fd to scan for IPv4 ports when in AUTO mode * @scan6: /proc/net fd to scan for IPv6 ports when in AUTO mode * @count: Number of forwarding rules * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between original and mapped destination port * @listen_sock_count: Number of entries used in @listen_socks * @listen_socks: Listening sockets for forwarding */
Done. I'd also be very open to more succinct names for these fields, but they haven't occurred to me yet.
*/ struct fwd_ports { enum fwd_ports_mode mode; @@ -83,6 +94,8 @@ struct fwd_ports { struct fwd_rule rules[MAX_FWD_RULES]; uint8_t map[PORT_BITMAP_SIZE]; in_port_t delta[NUM_PORTS]; + unsigned listen_sock_count; + int listen_socks[MAX_LISTEN_SOCKS]; };
#define FWD_PORT_SCAN_INTERVAL 1000 /* ms */
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 12:26:46AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:43 +1100 David Gibson
wrote: It's possible for a user to supply conflicting forwarding parameters, e.g. $ pasta -t 80:8080 -t 127.0.0.1/80:8888
We give a warning in this case, but it's based on the legacy forwarding bitmaps. This is too strict, because it will also warn on cases that shouldn't conflict because they use different addresses, e.g. $ pasta -t 192.0.2.1/80:8080 127.0.0.1/80:8888
Theoretically, it's also too loose because it won't take into account auto-scan forwarding rules. We can't hit that in practice now, because we only ever have one auto-scan rule and nothing else, but we want to remove that restriction in future.
Replace the bitmap based check with a check based on actually scanning the forwarding rules for conflicts.
Signed-off-by: David Gibson
--- conf.c | 5 ----- fwd.c | 21 ++++++++++++++++++++- inany.c | 19 +++++++++++++++++++ inany.h | 1 + 4 files changed, 40 insertions(+), 6 deletions(-) diff --git a/conf.c b/conf.c index 725cf88b..e8d6d5d9 100644 --- a/conf.c +++ b/conf.c @@ -172,11 +172,6 @@ static void conf_ports_range_except(const struct ctx *c, char optname, for (i = base; i <= last; i++) { if (exclude && bitmap_isset(exclude, i)) break; - - if (bitmap_isset(fwd->map, i)) { - warn( -"Altering mapping of already mapped port number: %s", optarg); - } }
if ((optname == 'T' || optname == 'U') && c->no_bindtodevice) { diff --git a/fwd.c b/fwd.c index 70ef73a3..5208155b 100644 --- a/fwd.c +++ b/fwd.c @@ -348,7 +348,7 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, const uint8_t allowed_flags = FWD_WEAK | FWD_SCAN; unsigned num = (unsigned)last - first + 1; struct fwd_rule *new; - unsigned port; + unsigned i, port;
ASSERT(!(flags & ~allowed_flags));
@@ -357,6 +357,25 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, if ((fwd->listen_sock_count + num) > ARRAY_SIZE(fwd->listen_socks)) die("Too many listening sockets");
+ /* Check for any conflicting entries */ + for (i = 0; i < fwd->count; i++) { + char newstr[INANY_ADDRSTRLEN], rulestr[INANY_ADDRSTRLEN]; + struct fwd_rule *rule = &fwd->rules[i]; + + if (!inany_matches(addr, fwd_rule_addr(rule))) + /* Non-conflicting addresses */ + continue; + + if (last < rule->first || rule->last < first) + /* Port ranges don't overlap */ + continue; + + die("Forwarding configuration conflict: %s/%u-%u versus %s/%u-%u", + inany_ntop(addr, newstr, sizeof(newstr)), first, last, + inany_ntop(fwd_rule_addr(rule), rulestr, sizeof(rulestr)), + rule->first, rule->last);
Same as comments to earlier places in fwd_rule_add(): we'll eventually trigger this from a client so we should eventually report failure rather than quitting.
Right, and in the same way I'm planning to change all the die()s to return errors in a later patch.
+ } + new = &fwd->rules[fwd->count++]; new->flags = flags;
diff --git a/inany.c b/inany.c index 87a4d8b6..a8c44237 100644 --- a/inany.c +++ b/inany.c @@ -21,6 +21,25 @@ const union inany_addr inany_loopback4 = INANY_INIT4(IN4ADDR_LOOPBACK_INIT); const union inany_addr inany_any4 = INANY_INIT4(IN4ADDR_ANY_INIT);
+/** inany_matches - Do two addresses match
Nit: "Do [...] match?"
Fixed.
+ * @a, @b: IPv[46] addresses (NULL for 0.0.0.0 & ::) + * + * Return: true if they match, false otherwise + * + * Addresses match themselves, but also with unspecified addresses of the same + * family. + */ +bool inany_matches(const union inany_addr *a, const union inany_addr *b) +{ + if (!a || !b) + return true; + + if (inany_is_unspecified(a) || inany_is_unspecified(b)) + return !!inany_v4(a) == !!inany_v4(b); + + return inany_equals(a, b); +} + /** inany_ntop - Convert an IPv[46] address to text format * @src: IPv[46] address (NULL for unspecified) * @dst: output buffer, minimum INANY_ADDRSTRLEN bytes diff --git a/inany.h b/inany.h index 61b36fb4..b02c2891 100644 --- a/inany.h +++ b/inany.h @@ -293,6 +293,7 @@ static inline void inany_siphash_feed(struct siphash_state *state,
#define INANY_ADDRSTRLEN MAX(INET_ADDRSTRLEN, INET6_ADDRSTRLEN)
+bool inany_matches(const union inany_addr *a, const union inany_addr *b); const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size); int inany_pton(const char *src, union inany_addr *dst);
I reviewed up to 10/14 only (I have no comments on that one), I still need a bit of time for the remaining four patches.
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 12:26:36AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:41 +1100 David Gibson
wrote: Previously we created inbound listening sockets as we parsed the forwarding options (-t, -u) whereas outbound listening sockets were created during {tcp,udp}_init(). Now that we have a data structure recording the full details of the listening options we can move all listening socket creation to {tcp,udp}_init(). This means that errors for either direction are detected and reported the same way.
Introduce fwd_listen_sync() which synchronizes the state of listening sockets to the forwarding rules table, both for fixed and automatic forwards.
This does cause a change in semantics for "exclude only" port specifications. Previously an option like -t ~6000 wouldn't cause a fatal error, as long as we could bind at least one port. Now, it requires at least one port for each generated rule; that is for each of the contiguous blocks of ports the specification resolves to. With typical ephemeral ports settings that's one port each in 1..5999, 6001..32767 and 61000..65535.
Preserving the exact behaviour for this case would require a considerably more complex data structure, so I'm hoping this is a sufficiently niche case for the change to be acceptable.
I guess so too, I wouldn't really worry.
Well, I'm not sure if it works, but one relatively simple idea could be to have a "with_prev" bit in the rule struct representing the fact that the current rule was derived from the same port specification as the previous rule, which implies they would need to be deleted all together (but we can happily enforce that).
Then, in the fwd_listen_sync_() loop, before reporting failure, you would check the next entry: if the "with_prev" bit is set, report failure only if we fail (keeping a local boolean flag) for all the entries up to the first one with "with_prev" unset.
I'll keep that approach in mind if it seems like we need it.
I would be inclined to say it's worth it if it's that simple, but I haven't tried, so I might be very well missing something.
I also considered making WEAK mean we'd always continue on listen failures, even if all of them fail. Maybe that's a bit unexpected? But it would allow an option to "forward single port X, if you can" which seems like it might be useful.
Signed-off-by: David Gibson
--- conf.c | 27 ---------- fwd.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-- fwd.h | 3 ++ ip.c | 1 - tcp.c | 122 ++--------------------------------------- tcp.h | 1 - udp.c | 99 +++------------------------------- udp.h | 1 - 8 files changed, 177 insertions(+), 244 deletions(-) diff --git a/conf.c b/conf.c index 0bcf80d7..57693b3f 100644 --- a/conf.c +++ b/conf.c @@ -148,9 +148,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname, uint8_t flags) { unsigned delta = to - first; - bool bound_one = false; unsigned base, i; - int fd;
if (first == 0) { die("Can't forward port 0 for option '-%c %s'", @@ -179,28 +177,6 @@ static void conf_ports_range_except(const struct ctx *c, char optname, warn( "Altering mapping of already mapped port number: %s", optarg); } - - if (!(flags & FWD_SCAN) && optname == 't') - fd = tcp_listen(c, PIF_HOST, addr, ifname, i); - else if (!(flags & FWD_SCAN) && optname == 'u') - fd = udp_listen(c, PIF_HOST, addr, ifname, i); - else - /* No way to check in advance for -T and -U */ - fd = 0; - - if (fd == -ENFILE || fd == -EMFILE) { - die( -"Can't open enough sockets for port specifier: %s", - optarg); - } - - if (fd >= 0) { - bound_one = true; - } else if (!(flags & FWD_WEAK)) { - die( -"Failed to bind port %u (%s) for option '-%c %s'", - i, strerror_(-fd), optname, optarg); - } }
if ((optname == 'T' || optname == 'U') && c->no_bindtodevice) { @@ -226,9 +202,6 @@ static void conf_ports_range_except(const struct ctx *c, char optname, } base = i - 1; } - - if (!bound_one) - die("Failed to bind any port for '-%c %s'", optname, optarg); }
/** diff --git a/fwd.c b/fwd.c index f27a4220..70ef73a3 100644 --- a/fwd.c +++ b/fwd.c @@ -22,6 +22,7 @@ #include
#include "util.h" +#include "epoll_ctl.h" #include "ip.h" #include "siphash.h" #include "inany.h" @@ -420,6 +421,160 @@ void fwd_rules_print(const struct fwd_ports *fwd) } }
+/** fwd_sync_one() - Create or remove listening sockets for a forward entry + * @c: Execution context + * @rule: Forwarding rule + * @pif: Interface to create listening sockets for + * @proto: Protocol to listen for + * @scanmap: Bitmap of ports to listen for on FWD_SCAN entries + * + * Return: 0 on success, -1 on failure + */ +static int fwd_sync_one(const struct ctx *c, const struct fwd_rule *rule, + uint8_t pif, uint8_t proto, const uint8_t *scanmap) +{ + const union inany_addr *addr = fwd_rule_addr(rule); + const char *ifname = rule->ifname; + bool bound_one = false; + unsigned port; + + ASSERT(pif_is_socket(pif)); + + if (!*ifname) + ifname = NULL; + + for (port = rule->first; port <= rule->last; port++) { + int fd = rule->socks[port - rule->first]; + + if ((rule->flags & FWD_SCAN) && !bitmap_isset(scanmap, port)) { + /* We don't want to listen on this port */ + if (fd >= 0) { + /* We already are, so stop */ + epoll_del(c->epollfd, fd); + close(fd); + rule->socks[port - rule->first] = -1; + } + continue; + } + + if (fd >= 0) /* Already listening, nothing to do */ { + bound_one = true; + continue; + } + + if (proto == IPPROTO_TCP) + fd = tcp_listen(c, pif, addr, ifname, port); + else if (proto == IPPROTO_UDP) + fd = udp_listen(c, pif, addr, ifname, port); + else + ASSERT(0); + + if (fd < 0) { + char astr[INANY_ADDRSTRLEN] = "";
Should we perhaps make this "*" for consistency with fwd_rules_print()?
Good idea, that simplifies things a bit too. This code predates my extension to inany_ntop(), and I forgot to rework it to take advantage.
+ + if (addr) + inany_ntop(addr, astr, sizeof(astr)); + + warn("Listen failed for %s %s port %s%s%s%s%u: %s", + pif_name(pif), ipproto_name(proto), + astr, ifname ? "%" : "", ifname ? ifname : "", + addr || ifname ? "/" : "", port, strerror_(-fd)); + + if (!(rule->flags & FWD_WEAK)) + return -1; + + continue; + } + + rule->socks[port - rule->first] = fd; + bound_one = true; + } + + if (!bound_one && !(rule->flags & FWD_SCAN)) { + char astr[INANY_ADDRSTRLEN] = "";
Same here.
Done.
+ + if (addr) + inany_ntop(addr, astr, sizeof(astr)); + + warn("All listens failed for %s %s %s%s%s%s%u-%u", + pif_name(pif), ipproto_name(proto), + astr, ifname ? "%" : "", ifname ? ifname : "", + addr || ifname ? "/" : "", rule->first, rule->last); + return -1; + } + + return 0; +} + +/** struct fwd_listen_args - arguments for fwd_listen_init_() + * @c: Execution context + * @fwd: Forwarding information + * @scanmap: Bitmap of ports to auto-forward + * @pif: Interface to create listening sockets for + * @proto: Protocol + * @ret: Return code + */ +struct fwd_listen_args { + const struct ctx *c; + const struct fwd_ports *fwd; + const uint8_t *scanmap; + uint8_t pif; + uint8_t proto; + int ret; +}; + +/** fwd_listen_sync_() - Update listening sockets to match forwards + * @arg: struct fwd_listen_args with arguments + * + * Returns: zero + */ +static int fwd_listen_sync_(void *arg) +{ + struct fwd_listen_args *a = arg; + unsigned i; + + if (a->pif == PIF_SPLICE) + ns_enter(a->c); + + for (i = 0; i < a->fwd->count; i++) { + a->ret = fwd_sync_one(a->c, &a->fwd->rules[i], + a->pif, a->proto, a->fwd->map); + if (a->ret < 0) + break; + } + + return 0; +} + +/** fwd_listen_sync() - Update listening sockets to match forwards
This has the same description as fwd_listen_sync_() and it might be quite hard to understand the difference if one is not used to spot the "void *arg" argument. What about:
/** fwd_listen_sync() - Call fwd_listen_sync_() in the intended namespace
?
Fair point, done.
+ * @c: Execution context + * @fwd: Forwarding information + * @pif: Interface to create listening sockets for + * @proto: Protocol + * + * Return: 0 on success, -1 on failure + */ +int fwd_listen_sync(const struct ctx *c, const struct fwd_ports *fwd, + uint8_t pif, uint8_t proto) +{ + struct fwd_listen_args a = { + .c = c, .fwd = fwd, .pif = pif, .proto = proto, + }; + + if (pif == PIF_SPLICE) + NS_CALL(fwd_listen_sync_, &a); + else + fwd_listen_sync_(&a); + + if (a.ret < 0) { + err("Couldn't listen on requested %s ports", + ipproto_name(proto)); + return -1; + } + + return 0; +} + /* See enum in kernel's include/net/tcp_states.h */ #define UDP_LISTEN 0x07 #define TCP_LISTEN 0x0a @@ -578,10 +733,14 @@ void fwd_scan_ports_timer(struct ctx *c, const struct timespec *now)
fwd_scan_ports(c);
- if (!c->no_tcp) - tcp_port_rebind_all(c); - if (!c->no_udp) - udp_port_rebind_all(c); + if (!c->no_tcp) { + fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP); + fwd_listen_sync(c, &c->tcp.fwd_out, PIF_SPLICE, IPPROTO_TCP); + } + if (!c->no_udp) { + fwd_listen_sync(c, &c->udp.fwd_in, PIF_HOST, IPPROTO_UDP); + fwd_listen_sync(c, &c->udp.fwd_out, PIF_SPLICE, IPPROTO_UDP); + } }
/** diff --git a/fwd.h b/fwd.h index 3ddcb91d..f84e7c01 100644 --- a/fwd.h +++ b/fwd.h @@ -108,6 +108,9 @@ void fwd_rules_print(const struct fwd_ports *fwd); void fwd_scan_ports_init(struct ctx *c); void fwd_scan_ports_timer(struct ctx * c, const struct timespec *now);
+int fwd_listen_sync(const struct ctx *c, const struct fwd_ports *fwd, + uint8_t pif, uint8_t proto); + bool nat_inbound(const struct ctx *c, const union inany_addr *addr, union inany_addr *translated); uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, diff --git a/ip.c b/ip.c index f1d224bd..fc26dab2 100644 --- a/ip.c +++ b/ip.c @@ -78,7 +78,6 @@ found: * /etc/protocols and might allocate, which isn't possible for us once * self-isolated. */ -/* cppcheck-suppress unusedFunction */ const char *ipproto_name(uint8_t proto) { switch (proto) { diff --git a/tcp.c b/tcp.c index 57faed4b..976f0ab7 100644 --- a/tcp.c +++ b/tcp.c @@ -2732,50 +2732,6 @@ int tcp_listen(const struct ctx *c, uint8_t pif, return s; }
-/** - * tcp_ns_listen() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void tcp_ns_listen(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_tcp); - - if (!c->no_bindtodevice) { - tcp_listen(c, PIF_SPLICE, NULL, "lo", port); - return; - } - - if (c->ifi4) - tcp_listen(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - tcp_listen(c, PIF_SPLICE, &inany_loopback6, NULL, port); -} - -/** - * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections - * @arg: Execution context - * - * Return: 0 - */ -/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */ -static int tcp_ns_socks_init(void *arg) -{ - const struct ctx *c = (const struct ctx *)arg; - unsigned port; - - ns_enter(c); - - for (port = 0; port < NUM_PORTS; port++) { - if (!bitmap_isset(c->tcp.fwd_out.map, port)) - continue; - - tcp_ns_listen(c, port); - } - - return 0; -} - /** * tcp_sock_refill_pool() - Refill one pool of pre-opened sockets * @pool: Pool of sockets to refill @@ -2919,10 +2875,13 @@ int tcp_init(struct ctx *c)
tcp_sock_refill_init(c);
+ if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP) < 0) + return -1;
This needs an update to the function comment (which currently says "Return: 0, doesn't return on failure").
* Return: 0, doesn't return on failure
Fixed.
if (c->mode == MODE_PASTA) { tcp_splice_init(c); - - NS_CALL(tcp_ns_socks_init, c); + if (fwd_listen_sync(c, &c->tcp.fwd_out, + PIF_SPLICE, IPPROTO_TCP) < 0) + return -1; }
peek_offset_cap = (!c->ifi4 || tcp_probe_peek_offset_cap(AF_INET)) && @@ -2941,77 +2900,6 @@ int tcp_init(struct ctx *c) return 0; }
-/** - * tcp_port_rebind() - Rebind ports to match forward maps - * @c: Execution context - * @outbound: True to remap outbound forwards, otherwise inbound - * - * Must be called in namespace context if @outbound is true. - */ -static void tcp_port_rebind(struct ctx *c, bool outbound) -{ - const uint8_t *fmap = outbound ? c->tcp.fwd_out.map : c->tcp.fwd_in.map; - int (*socks)[IP_VERSIONS] = outbound ? tcp_sock_ns : tcp_sock_init_ext; - unsigned port; - - for (port = 0; port < NUM_PORTS; port++) { - if (!bitmap_isset(fmap, port)) { - if (socks[port][V4] >= 0) { - close(socks[port][V4]); - socks[port][V4] = -1; - } - - if (socks[port][V6] >= 0) { - close(socks[port][V6]); - socks[port][V6] = -1; - } - - continue; - } - - if ((c->ifi4 && socks[port][V4] == -1) || - (c->ifi6 && socks[port][V6] == -1)) { - if (outbound) - tcp_ns_listen(c, port); - else - tcp_listen(c, PIF_HOST, NULL, NULL, port); - } - } -} - -/** - * tcp_port_rebind_outbound() - Rebind ports in namespace - * @arg: Execution context - * - * Called with NS_CALL() - * - * Return: 0 - */ -static int tcp_port_rebind_outbound(void *arg) -{ - struct ctx *c = (struct ctx *)arg; - - ns_enter(c); - tcp_port_rebind(c, true); - - return 0; -} - -/** - * tcp_port_rebind_all() - Rebind ports to match forward maps (in host & ns) - * @c: Execution context - */ -void tcp_port_rebind_all(struct ctx *c) -{ - ASSERT(c->mode == MODE_PASTA && !c->no_tcp); - - if (c->tcp.fwd_out.mode == FWD_AUTO) - NS_CALL(tcp_port_rebind_outbound, c); - - if (c->tcp.fwd_in.mode == FWD_AUTO) - tcp_port_rebind(c, false); -} - /** * tcp_timer() - Periodic tasks: port detection, closed connections, pool refill * @c: Execution context diff --git a/tcp.h b/tcp.h index ef1e3544..45f97d93 100644 --- a/tcp.h +++ b/tcp.h @@ -22,7 +22,6 @@ int tcp_listen(const struct ctx *c, uint8_t pif, const union inany_addr *addr, const char *ifname, in_port_t port); int tcp_init(struct ctx *c); -void tcp_port_rebind_all(struct ctx *c); void tcp_timer(const struct ctx *c, const struct timespec *now); void tcp_defer_handler(struct ctx *c);
diff --git a/udp.c b/udp.c index d7dcb1d2..7c5546df 100644 --- a/udp.c +++ b/udp.c @@ -1203,98 +1203,6 @@ static void udp_splice_iov_init(void) } }
-/** - * udp_ns_listen() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void udp_ns_listen(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_udp); - - if (!c->no_bindtodevice) { - udp_listen(c, PIF_SPLICE, NULL, "lo", port); - return; - } - - if (c->ifi4) - udp_listen(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - udp_listen(c, PIF_SPLICE, &inany_loopback6, NULL, port); -} - -/** - * udp_port_rebind() - Rebind ports to match forward maps - * @c: Execution context - * @outbound: True to remap outbound forwards, otherwise inbound - * - * Must be called in namespace context if @outbound is true. - */ -static void udp_port_rebind(struct ctx *c, bool outbound) -{ - int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; - const uint8_t *fmap - = outbound ? c->udp.fwd_out.map : c->udp.fwd_in.map; - unsigned port; - - for (port = 0; port < NUM_PORTS; port++) { - if (!bitmap_isset(fmap, port)) { - if (socks[V4][port] >= 0) { - close(socks[V4][port]); - socks[V4][port] = -1; - } - - if (socks[V6][port] >= 0) { - close(socks[V6][port]); - socks[V6][port] = -1; - } - - continue; - } - - if ((c->ifi4 && socks[V4][port] == -1) || - (c->ifi6 && socks[V6][port] == -1)) { - if (outbound) - udp_ns_listen(c, port); - else - udp_listen(c, PIF_HOST, NULL, NULL, port); - } - } -} - -/** - * udp_port_rebind_outbound() - Rebind ports in namespace - * @arg: Execution context - * - * Called with NS_CALL() - * - * Return: 0 - */ -static int udp_port_rebind_outbound(void *arg) -{ - struct ctx *c = (struct ctx *)arg; - - ns_enter(c); - udp_port_rebind(c, true); - - return 0; -} - -/** - * udp_port_rebind_all() - Rebind ports to match forward maps (in host & ns) - * @c: Execution context - */ -void udp_port_rebind_all(struct ctx *c) -{ - ASSERT(c->mode == MODE_PASTA && !c->no_udp); - - if (c->udp.fwd_out.mode == FWD_AUTO) - NS_CALL(udp_port_rebind_outbound, c); - - if (c->udp.fwd_in.mode == FWD_AUTO) - udp_port_rebind(c, false); -} - /** * udp_init() - Initialise per-socket data, and sockets in namespace * @c: Execution context @@ -1307,9 +1215,14 @@ int udp_init(struct ctx *c)
udp_iov_init(c);
+ if (fwd_listen_sync(c, &c->udp.fwd_in, PIF_HOST, IPPROTO_UDP) < 0) + return -1;
Same here, update to the function comment needed.
Fixed.
+ if (c->mode == MODE_PASTA) { udp_splice_iov_init(); - NS_CALL(udp_port_rebind_outbound, c); + if (fwd_listen_sync(c, &c->udp.fwd_out, + PIF_SPLICE, IPPROTO_UDP) < 0) + return -1; }
return 0; diff --git a/udp.h b/udp.h index 94c698e2..73efe036 100644 --- a/udp.h +++ b/udp.h @@ -19,7 +19,6 @@ int udp_listen(const struct ctx *c, uint8_t pif, const union inany_addr *addr, const char *ifname, in_port_t port); int udp_init(struct ctx *c); -void udp_port_rebind_all(struct ctx *c); void udp_update_l2_buf(const unsigned char *eth_d);
/**
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 8 Jan 2026 13:29:46 +1100
David Gibson
Currently we remap port numbers based on the legacy delta[] array, which is indexed only by original port number, not the listening address. Now that we look up a forwarding rule entry in flow_target(), we can use this entry to directly determine the correct remapped port. Implement this, and remove the old delta[] array.
Link: https://bugs.passt.top/show_bug.cgi?id=187
Signed-off-by: David Gibson
--- flow.c | 7 ++++--- fwd.c | 21 +++++++-------------- fwd.h | 7 +++---- 3 files changed, 14 insertions(+), 21 deletions(-) diff --git a/flow.c b/flow.c index 045e9712..38f88732 100644 --- a/flow.c +++ b/flow.c @@ -493,6 +493,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, struct flow_common *f = &flow->f; const struct flowside *ini = &f->side[INISIDE]; struct flowside *tgt = &f->side[TGTSIDE]; + const struct fwd_rule *rule = NULL;
Coverity Scan complains about two cases where this NULL rule would be passed to NAT functions dereferencing it. They should be both false positives because the rule is always set on pif_is_socket(...), but I wonder how fragile it is. Regardless, it would be nice to avoid adding further warnings, if it's cheap (ASSERT() or check on rule in fwd_nat_from*() functions?). Anyway, here you go: /home/sbrivio/passt/flow.c:535:3: Type: Explicit null dereferenced (FORWARD_NULL) /home/sbrivio/passt/flow.c:496:2: 1. assign_zero: Assigning: "rule" = "NULL". /home/sbrivio/passt/flow.c:499:2: 2. path: Condition "flow_new_entry == flow", taking true branch. /home/sbrivio/passt/flow.c:499:2: 3. path: Condition "f->state == FLOW_STATE_INI", taking true branch. /home/sbrivio/passt/flow.c:500:2: 4. path: Condition "f->type == FLOW_TYPE_NONE", taking true branch. /home/sbrivio/passt/flow.c:501:2: 5. path: Condition "f->pif[0] != PIF_NONE", taking true branch. /home/sbrivio/passt/flow.c:501:2: 6. path: Condition "f->pif[1] == PIF_NONE", taking true branch. /home/sbrivio/passt/flow.c:502:2: 7. path: Condition "flow->f.state == FLOW_STATE_INI", taking true branch. /home/sbrivio/passt/flow.c:504:2: 8. path: Condition "pif_is_socket(f->pif[0])", taking false branch. /home/sbrivio/passt/flow.c:528:2: 9. path: Switch case value "PIF_SPLICE". /home/sbrivio/passt/flow.c:535:3: 10. var_deref_model: Passing null pointer "rule" to "fwd_nat_from_splice", which dereferences it. /home/sbrivio/passt/fwd.c:991:2: 10.1. path: Condition "!inany_is_loopback(&ini->eaddr)", taking false branch. /home/sbrivio/passt/fwd.c:991:2: 10.2. path: Condition "!inany_is_loopback(&ini->oaddr)", taking false branch. /home/sbrivio/passt/fwd.c:1008:2: 10.3. path: Condition "proto == IPPROTO_UDP", taking true branch. /home/sbrivio/passt/fwd.c:1012:2: 10.4. dereference: Dereferencing pointer "rule". /home/sbrivio/passt/flow.c:539:3: Type: Explicit null dereferenced (FORWARD_NULL) /home/sbrivio/passt/flow.c:496:2: 1. assign_zero: Assigning: "rule" = "NULL". /home/sbrivio/passt/flow.c:499:2: 2. path: Condition "flow_new_entry == flow", taking true branch. /home/sbrivio/passt/flow.c:499:2: 3. path: Condition "f->state == FLOW_STATE_INI", taking true branch. /home/sbrivio/passt/flow.c:500:2: 4. path: Condition "f->type == FLOW_TYPE_NONE", taking true branch. /home/sbrivio/passt/flow.c:501:2: 5. path: Condition "f->pif[0] != PIF_NONE", taking true branch. /home/sbrivio/passt/flow.c:501:2: 6. path: Condition "f->pif[1] == PIF_NONE", taking true branch. /home/sbrivio/passt/flow.c:502:2: 7. path: Condition "flow->f.state == FLOW_STATE_INI", taking true branch. /home/sbrivio/passt/flow.c:504:2: 8. path: Condition "pif_is_socket(f->pif[0])", taking false branch. /home/sbrivio/passt/flow.c:528:2: 9. path: Switch case value "PIF_HOST". /home/sbrivio/passt/flow.c:539:3: 10. var_deref_model: Passing null pointer "rule" to "fwd_nat_from_host", which dereferences it. /home/sbrivio/passt/fwd.c:1069:2: 10.1. dereference: Dereferencing pointer "rule".
uint8_t tgtpif = PIF_NONE;
ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); @@ -514,7 +515,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, else goto nofwd;
- if (!fwd_rule_search(fwd, ini)) { + if (!(rule = fwd_rule_search(fwd, ini))) { /* This shouldn't happen, because if there's no rule for * it we should have no listening socket that would let * us get here @@ -531,11 +532,11 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, break;
case PIF_SPLICE: - tgtpif = fwd_nat_from_splice(c, proto, ini, tgt); + tgtpif = fwd_nat_from_splice(rule, proto, ini, tgt); break;
case PIF_HOST: - tgtpif = fwd_nat_from_host(c, proto, ini, tgt); + tgtpif = fwd_nat_from_host(c, rule, proto, ini, tgt); fwd_neigh_mac_get(c, &tgt->oaddr, f->tap_omac); break; default: diff --git a/fwd.c b/fwd.c index 633ba5db..7c4575ff 100644 --- a/fwd.c +++ b/fwd.c @@ -405,7 +405,6 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, /* Fill in the legacy forwarding data structures to match the table */ if (!(new->flags & FWD_SCAN)) bitmap_set(fwd->map, port); - fwd->delta[port] = new->to - new->first; } }
@@ -978,7 +977,7 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
/** * fwd_nat_from_splice() - Determine to forward a flow from the splice interface - * @c: Execution context + * @rule: Forwarding rule to apply * @proto: Protocol (IP L4 protocol number) * @ini: Flow address information of the initiating side * @tgt: Flow address information on the target side (updated) @@ -986,7 +985,7 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, * Return: pif of the target interface to forward the flow to, PIF_NONE if the * flow cannot or should not be forwarded at all. */ -uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, +uint8_t fwd_nat_from_splice(const struct fwd_rule *rule, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { if (!inany_is_loopback(&ini->eaddr) || @@ -1010,11 +1009,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, /* But for UDP preserve the source port */ tgt->oport = ini->eport;
- tgt->eport = ini->oport; - if (proto == IPPROTO_TCP) - tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; - else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_out.delta[tgt->eport]; + tgt->eport = ini->oport - rule->first + rule->to;
This made me notice that the current documentation to struct fwd_rule doesn't really make it clear that n : 1 port mapping rules are not a thing, so, maybe, back to 2/14, instead of: * @to: Port number to forward port @first to. we could have perhaps: * @to: Target port for @first, port n maps to @to + (n - @first) ? By the way, I find this notation a bit complicated to figure out, I think that: rule->to + (ini->oport - rule->first) (redundant parentheses included) is actually clearer. Or maybe even with a temporary 'delta' variable. In both cases: the subtraction now happens in in_port_t, so I guess we should explicitly cast rule->first to int to be strictly correct.
return PIF_HOST; } @@ -1058,6 +1053,7 @@ bool nat_inbound(const struct ctx *c, const union inany_addr *addr, /** * fwd_nat_from_host() - Determine to forward a flow from the host interface * @c: Execution context + * @rule: Forwarding rule to apply * @proto: Protocol (IP L4 protocol number) * @ini: Flow address information of the initiating side * @tgt: Flow address information on the target side (updated) @@ -1065,15 +1061,12 @@ bool nat_inbound(const struct ctx *c, const union inany_addr *addr, * Return: pif of the target interface to forward the flow to, PIF_NONE if the * flow cannot or should not be forwarded at all. */ -uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, +uint8_t fwd_nat_from_host(const struct ctx *c, + const struct fwd_rule *rule, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { /* Common for spliced and non-spliced cases */ - tgt->eport = ini->oport; - if (proto == IPPROTO_TCP) - tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; - else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_in.delta[tgt->eport]; + tgt->eport = ini->oport - rule->first + rule->to;
Same as above.
if (!c->no_splice && inany_is_loopback(&ini->eaddr) && (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { diff --git a/fwd.h b/fwd.h index a10bdbb4..cfe9ed46 100644 --- a/fwd.h +++ b/fwd.h @@ -82,7 +82,6 @@ enum fwd_ports_mode { * @count: Number of forwarding rules * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded - * @delta: Offset between the original destination and mapped port number * @listen_sock_count: Number of entries used in @listen_socks * @listen_socks: Listening sockets for forwarding */ @@ -93,7 +92,6 @@ struct fwd_ports { unsigned count; struct fwd_rule rules[MAX_FWD_RULES]; uint8_t map[PORT_BITMAP_SIZE]; - in_port_t delta[NUM_PORTS]; unsigned listen_sock_count; int listen_socks[MAX_LISTEN_SOCKS]; }; @@ -117,9 +115,10 @@ bool nat_inbound(const struct ctx *c, const union inany_addr *addr, union inany_addr *translated); uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt); -uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, +uint8_t fwd_nat_from_splice(const struct fwd_rule *rule, uint8_t proto, const struct flowside *ini, struct flowside *tgt); -uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, +uint8_t fwd_nat_from_host(const struct ctx *c, + const struct fwd_rule *rule, uint8_t proto, const struct flowside *ini, struct flowside *tgt); void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr, const uint8_t *mac, bool permanent);
-- Stefano
Silly nits only and a couple of remarks that will probably be clarified
at a later point:
On Thu, 8 Jan 2026 13:29:45 +1100
David Gibson
We now have a formal array of forwarding rules. However, we don't actually consult it when we forward a new flow. Instead we rely on (a) implicit information (we wouldn't be here if there wasn't a listening socket for the rule) and (b) the legacy delta[] data structure.
Start addressing this, by searching for a matching forwarding rule when attempting to forward a new flow. For now this is incomplete: * We only do this for socket-initiated flows * We make a potentially costly linear lookup * We don't actually use the matching rule for anything yet
We'll address each of those in later patches.
Signed-off-by: David Gibson
--- flow.c | 43 ++++++++++++++++++++++++++++++++++--------- fwd.c | 33 +++++++++++++++++++++++++++++++++ fwd.h | 2 ++ 3 files changed, 69 insertions(+), 9 deletions(-) diff --git a/flow.c b/flow.c index 4f534865..045e9712 100644 --- a/flow.c +++ b/flow.c @@ -489,7 +489,7 @@ struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, struct flowside *flow_target(const struct ctx *c, union flow *flow, uint8_t proto) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr[INANY_ADDRSTRLEN], ostr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; const struct flowside *ini = &f->side[INISIDE]; struct flowside *tgt = &f->side[TGTSIDE]; @@ -500,6 +500,30 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); ASSERT(flow->f.state == FLOW_STATE_INI);
+ if (pif_is_socket(f->pif[INISIDE])) { + const struct fwd_ports *fwd; + + if (f->pif[INISIDE] == PIF_HOST && proto == IPPROTO_TCP) + fwd = &c->tcp.fwd_in; + else if (f->pif[INISIDE] == PIF_HOST && proto == IPPROTO_UDP) + fwd = &c->udp.fwd_in; + else if (f->pif[INISIDE] == PIF_SPLICE && proto == IPPROTO_TCP) + fwd = &c->tcp.fwd_out; + else if (f->pif[INISIDE] == PIF_SPLICE && proto == IPPROTO_UDP) + fwd = &c->udp.fwd_out; + else + goto nofwd; + + if (!fwd_rule_search(fwd, ini)) { + /* This shouldn't happen, because if there's no rule for + * it we should have no listening socket that would let + * us get here + */ + flow_dbg(flow, "Unexpected missing forward rule"); + goto nofwd; + } + } + switch (f->pif[INISIDE]) { case PIF_TAP: memcpy(f->tap_omac, MAC_UNDEF, ETH_ALEN); @@ -514,22 +538,23 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, tgtpif = fwd_nat_from_host(c, proto, ini, tgt); fwd_neigh_mac_get(c, &tgt->oaddr, f->tap_omac); break; - default: - flow_err(flow, "No rules to forward %s [%s]:%hu -> [%s]:%hu", - pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), - ini->eport, - inany_ntop(&ini->oaddr, fstr, sizeof(fstr)), - ini->oport); + goto nofwd; }
if (tgtpif == PIF_NONE) - return NULL; + goto nofwd;
f->pif[TGTSIDE] = tgtpif; flow_set_state(f, FLOW_STATE_TGT); return tgt; + +nofwd: + flow_err(flow, "No rules to forward %s %s [%s]:%hu -> [%s]:%hu", + pif_name(f->pif[INISIDE]), ipproto_name(proto), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), ini->eport, + inany_ntop(&ini->oaddr, ostr, sizeof(ostr)), ini->oport);
This assumes we're still using this function for inbound forwards only (eaddr / eport -> oaddr / oport), perhaps we'll want a macro once it starts being used for the other way around as well (if at all). By the way, for rules, earlier in this series, you used "=>" to separate source and target of the forward, here it's still "->", I guess we should settle on one version (it just occurred to me while testing stuff: it might be useful to grep in logs).
+ return NULL; }
/** diff --git a/fwd.c b/fwd.c index 3f088fd2..633ba5db 100644 --- a/fwd.c +++ b/fwd.c @@ -409,6 +409,39 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, } }
+/** + * fwd_rule_match() - Does a prospective flow match a given forwarding rule
Does [...]?
+ * @rule: Forwarding rule + * @ini: Initiating side flow information + * + * Returns: true if the rule applies to the flow, false otherwise + */ +static bool fwd_rule_match(const struct fwd_rule *rule, + const struct flowside *ini) +{ + return inany_matches(&ini->oaddr, fwd_rule_addr(rule)) && + ini->oport >= rule->first && ini->oport <= rule->last;
The usual alignment is: return inany_matches(&ini->oaddr, fwd_rule_addr(rule)) && ini->oport >= rule->first && ini->oport <= rule->last;
+} + +/** + * fwd_rule_search() - Find a rule which matches a prospective flow + * @fwd: Forwarding table + * @ini: Initiating side flow information + * + * Returns: first matching rule, or NULL if there is none
I guess that this will eventually need to become a function matching the most specific rule first, tie breakers could be: 1. specific address given vs. wildcard 2. specific interface given vs. no interface 3. the day we support/need it: specific port/range vs. no port 4. smallest port range 5. the day we support/need something like this: longest prefix length and after this we should actually have an error on insertion (already guaranteed I think).
+ */ +const struct fwd_rule *fwd_rule_search(const struct fwd_ports *fwd, + const struct flowside *ini) +{ + unsigned i; + + for (i = 0; i < fwd->count; i++) { + if (fwd_rule_match(&fwd->rules[i], ini)) + return &fwd->rules[i]; + }
Extra newline here to clearly separate the two outcomes.
+ return NULL; +} + /** * fwd_rules_print() - Print forwarding rules for debugging * @fwd: Table to print diff --git a/fwd.h b/fwd.h index f84e7c01..a10bdbb4 100644 --- a/fwd.h +++ b/fwd.h @@ -103,6 +103,8 @@ struct fwd_ports { void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, const union inany_addr *addr, const char *ifname, in_port_t first, in_port_t last, in_port_t to); +const struct fwd_rule *fwd_rule_search(const struct fwd_ports *fwd, + const struct flowside *ini); void fwd_rules_print(const struct fwd_ports *fwd);
void fwd_scan_ports_init(struct ctx *c);
-- Stefano
On Thu, 8 Jan 2026 13:29:47 +1100
David Gibson
Now that we have a table of all our forwarding rules, every listening socket can be associated with a specific rule. Add an index allowing us to locate that rule from the socket's epoll reference. We don't use it yet, but we'll use it to optimise rule lookup when forwarding new flows.
Signed-off-by: David Gibson
--- fwd.c | 15 ++++++++++----- fwd.h | 5 +++++ tcp.c | 4 +++- tcp.h | 5 ++--- udp.c | 4 +++- udp.h | 5 ++--- 6 files changed, 25 insertions(+), 13 deletions(-) diff --git a/fwd.c b/fwd.c index 7c4575ff..6727d26f 100644 --- a/fwd.c +++ b/fwd.c @@ -474,6 +474,7 @@ void fwd_rules_print(const struct fwd_ports *fwd)
/** fwd_sync_one() - Create or remove listening sockets for a forward entry * @c: Execution context + * @fwd: Forwarding table * @rule: Forwarding rule * @pif: Interface to create listening sockets for * @proto: Protocol to listen for @@ -481,19 +482,23 @@ void fwd_rules_print(const struct fwd_ports *fwd) * * Return: 0 on success, -1 on failure */ -static int fwd_sync_one(const struct ctx *c, const struct fwd_rule *rule, +static int fwd_sync_one(const struct ctx *c, + const struct fwd_ports *fwd, const struct fwd_rule *rule, uint8_t pif, uint8_t proto, const uint8_t *scanmap) { const union inany_addr *addr = fwd_rule_addr(rule); const char *ifname = rule->ifname; bool bound_one = false; - unsigned port; + unsigned port, idx;
ASSERT(pif_is_socket(pif));
if (!*ifname) ifname = NULL;
+ idx = rule - fwd->rules; + ASSERT(idx < MAX_FWD_RULES); + for (port = rule->first; port <= rule->last; port++) { int fd = rule->socks[port - rule->first];
@@ -514,9 +519,9 @@ static int fwd_sync_one(const struct ctx *c, const struct fwd_rule *rule, }
if (proto == IPPROTO_TCP) - fd = tcp_listen(c, pif, addr, ifname, port); + fd = tcp_listen(c, pif, idx, addr, ifname, port); else if (proto == IPPROTO_UDP) - fd = udp_listen(c, pif, addr, ifname, port); + fd = udp_listen(c, pif, idx, addr, ifname, port); else ASSERT(0);
@@ -588,7 +593,7 @@ static int fwd_listen_sync_(void *arg) ns_enter(a->c);
for (i = 0; i < a->fwd->count; i++) { - a->ret = fwd_sync_one(a->c, &a->fwd->rules[i], + a->ret = fwd_sync_one(a->c, a->fwd, &a->fwd->rules[i], a->pif, a->proto, a->fwd->map); if (a->ret < 0) break; diff --git a/fwd.h b/fwd.h index cfe9ed46..435f422a 100644 --- a/fwd.h +++ b/fwd.h @@ -48,14 +48,19 @@ struct fwd_rule { * union fwd_listen_ref - information about a single listening socket * @port: Bound port number of the socket * @pif: pif in which the socket is listening + * @rule: Index of forwarding rule */ union fwd_listen_ref { struct { in_port_t port; uint8_t pif; +#define FWD_RULE_BITS 8 + unsigned rule :FWD_RULE_BITS; }; uint32_t u32; }; +static_assert(sizeof(union fwd_listen_ref) == sizeof(uint32_t));
Why do we need this, specifically?
+static_assert(MAX_FWD_RULES <= (1U << FWD_RULE_BITS));
I start wondering if instead of having a 'rule' field supporting 256 rules, with 128 as maximum number of rules, we could just have 256 as maximum number of rules and use the usual MAX_FROM_BITS() macro to keep things simpler. After all, it's not really rules[] taking space: struct fwd_ports { enum fwd_ports_mode mode; /* 0 4 */ int scan4; /* 4 4 */ int scan6; /* 8 4 */ unsigned int count; /* 12 4 */ struct fwd_rule rules[128]; /* 16 7168 */ /* --- cacheline 112 boundary (7168 bytes) was 16 bytes ago --- */ uint8_t map[8192]; /* 7184 8192 */ /* --- cacheline 240 boundary (15360 bytes) was 16 bytes ago --- */ unsigned int listen_sock_count; /* 15376 4 */ int listen_socks[196608]; /* 15380 786432 */ /* size: 801816, cachelines: 12529, members: 8 */ /* padding: 4 */ /* last cacheline: 24 bytes */ };
enum fwd_ports_mode { FWD_UNSET = 0, diff --git a/tcp.c b/tcp.c index e9b440da..fc03e38f 100644 --- a/tcp.c +++ b/tcp.c @@ -2672,18 +2672,20 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, * tcp_listen() - Create listening socket * @c: Execution context * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) + * @rule: Index of relevant forwarding rule * @addr: Pointer to address for binding, NULL for any * @ifname: Name of interface to bind to, NULL for any * @port: Port, host order * * Return: Socket fd on success, negative error code on failure */ -int tcp_listen(const struct ctx *c, uint8_t pif, +int tcp_listen(const struct ctx *c, uint8_t pif, unsigned rule, const union inany_addr *addr, const char *ifname, in_port_t port) { union fwd_listen_ref ref = { .port = port, .pif = pif, + .rule = rule, }; int s;
diff --git a/tcp.h b/tcp.h index 45f97d93..24b90870 100644 --- a/tcp.h +++ b/tcp.h @@ -18,9 +18,8 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, uint32_t flow_lbl, const struct pool *p, int idx, const struct timespec *now); -int tcp_listen(const struct ctx *c, uint8_t pif, - const union inany_addr *addr, const char *ifname, - in_port_t port); +int tcp_listen(const struct ctx *c, uint8_t pif, unsigned rule, + const union inany_addr *addr, const char *ifname, in_port_t port); int tcp_init(struct ctx *c); void tcp_timer(const struct ctx *c, const struct timespec *now); void tcp_defer_handler(struct ctx *c); diff --git a/udp.c b/udp.c index 92a87198..761221f6 100644 --- a/udp.c +++ b/udp.c @@ -1115,18 +1115,20 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, * udp_listen() - Initialise listening socket for a given port * @c: Execution context * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) + * @rule: Index of relevant forwarding rule * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order * * Return: Socket fd on success, negative error code on failure */ -int udp_listen(const struct ctx *c, uint8_t pif, +int udp_listen(const struct ctx *c, uint8_t pif, unsigned rule, const union inany_addr *addr, const char *ifname, in_port_t port) { union fwd_listen_ref ref = { .pif = pif, .port = port, + .rule = rule, }; int s;
diff --git a/udp.h b/udp.h index 3c6f90a9..2b91d728 100644 --- a/udp.h +++ b/udp.h @@ -14,9 +14,8 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, uint8_t ttl, const struct pool *p, int idx, const struct timespec *now); -int udp_listen(const struct ctx *c, uint8_t pif, - const union inany_addr *addr, const char *ifname, - in_port_t port); +int udp_listen(const struct ctx *c, uint8_t pif, unsigned rule, + const union inany_addr *addr, const char *ifname, in_port_t port); int udp_init(struct ctx *c); void udp_update_l2_buf(const unsigned char *eth_d);
-- Stefano
On Thu, 8 Jan 2026 13:29:48 +1100
David Gibson
Now that listening sockets include a reference to the forwarding rule which created them we can, in many cases, avoid a linear search of the forwarding table when we want to find the relevant rule. Instead we can take the rule index from the socket's epoll reference, and use that to immediately find the correct rule.
This is conceptually simple, but requires a moderate amount of plumbing to get the index from the reference through to the rule lookup. We still allow fall back to linear search if we don't have the index, and this may (rarely) be used in the udp_flush_flow() case, where we could get packets for one flow on a different flow's socket, rather than through a listening socket as usual.
Signed-off-by: David Gibson
--- flow.c | 13 +++++++++++-- flow_table.h | 2 +- fwd.c | 3 +-- fwd.h | 1 + icmp.c | 2 +- tcp.c | 4 ++-- udp.c | 14 +++++++++----- udp_flow.c | 14 ++++++++------ udp_flow.h | 2 +- udp_internal.h | 4 ++-- 10 files changed, 37 insertions(+), 22 deletions(-) diff --git a/flow.c b/flow.c index 38f88732..85759970 100644 --- a/flow.c +++ b/flow.c @@ -482,12 +482,13 @@ struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, * flow_target() - Determine where flow should forward to, and move to TGT * @c: Execution context * @flow: Flow to forward + * @rule_hint: Index of relevant forwarding rule, or -1 if unknown * @proto: Protocol * * Return: pointer to the target flowside information */ struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto) + int rule_hint, uint8_t proto) { char estr[INANY_ADDRSTRLEN], ostr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; @@ -515,7 +516,15 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, else goto nofwd;
- if (!(rule = fwd_rule_search(fwd, ini))) { + if (rule_hint >= 0) { + ASSERT((unsigned)rule_hint < fwd->count); + rule = &fwd->rules[rule_hint]; + if (!fwd_rule_match(rule, ini)) { + flow_dbg(flow, + "Unexpected mismatching forward rule"); + goto nofwd; + } + } else if (!(rule = fwd_rule_search(fwd, ini))) {
*If* we need this fwd_rule_search() / fwd_rule_match() / else if fwd_rule_search() pattern in more places, it might be convenient to hide the complexity by taking 'rule_hint' as argument for fwd_rule_search() perhaps, and then do something like this in fwd_rule_search(): for (i = hint == -1 ? 0 : hint; i < fwd->count; i++) { if (fwd_rule_match(&fwd->rules[i], ini)) return &fwd->rules[i]; if (hint) break; } return NULL; but I haven't really thought this through. It makes fwd_rule_search() a bit ugly.
/* This shouldn't happen, because if there's no rule for * it we should have no listening socket that would let * us get here diff --git a/flow_table.h b/flow_table.h index 5ee13acc..73de13ba 100644 --- a/flow_table.h +++ b/flow_table.h @@ -207,7 +207,7 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto); + int rule_hint, uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/fwd.c b/fwd.c index 6727d26f..250b470c 100644 --- a/fwd.c +++ b/fwd.c @@ -415,8 +415,7 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, * * Returns: true if the rule applies to the flow, false otherwise */ -static bool fwd_rule_match(const struct fwd_rule *rule, - const struct flowside *ini) +bool fwd_rule_match(const struct fwd_rule *rule, const struct flowside *ini) { return inany_matches(&ini->oaddr, fwd_rule_addr(rule)) && ini->oport >= rule->first && ini->oport <= rule->last; diff --git a/fwd.h b/fwd.h index 435f422a..2af7791d 100644 --- a/fwd.h +++ b/fwd.h @@ -106,6 +106,7 @@ struct fwd_ports { void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, const union inany_addr *addr, const char *ifname, in_port_t first, in_port_t last, in_port_t to); +bool fwd_rule_match(const struct fwd_rule *rule, const struct flowside *ini); const struct fwd_rule *fwd_rule_search(const struct fwd_ports *fwd, const struct flowside *ini); void fwd_rules_print(const struct fwd_ports *fwd); diff --git a/icmp.c b/icmp.c index 9564c496..0f4d48bb 100644 --- a/icmp.c +++ b/icmp.c @@ -183,7 +183,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL;
flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - if (!(tgt = flow_target(c, flow, proto))) + if (!(tgt = flow_target(c, flow, -1, proto))) goto cancel;
if (flow->f.pif[TGTSIDE] != PIF_HOST) { diff --git a/tcp.c b/tcp.c index fc03e38f..89b22a51 100644 --- a/tcp.c +++ b/tcp.c @@ -1657,7 +1657,7 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, ini = flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport);
- if (!(tgt = flow_target(c, flow, IPPROTO_TCP))) + if (!(tgt = flow_target(c, flow, -1, IPPROTO_TCP))) goto cancel;
if (flow->f.pif[TGTSIDE] != PIF_HOST) { @@ -2496,7 +2496,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, goto cancel; }
- if (!flow_target(c, flow, IPPROTO_TCP)) + if (!flow_target(c, flow, ref.listen.rule, IPPROTO_TCP)) goto cancel;
switch (flow->f.pif[TGTSIDE]) { diff --git a/udp.c b/udp.c index 761221f6..d7e47e52 100644 --- a/udp.c +++ b/udp.c @@ -838,12 +838,13 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n, * udp_sock_fwd() - Forward datagrams from a possibly unconnected socket * @c: Execution context * @s: Socket to forward from + * @rule_hint: Forwarding rule to use, or -1 if unknown * @frompif: Interface to which @s belongs * @port: Our (local) port number of @s * @now: Current timestamp */ -void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, - in_port_t port, const struct timespec *now) +void udp_sock_fwd(const struct ctx *c, int s, int rule_hint, + uint8_t frompif, in_port_t port, const struct timespec *now) { union sockaddr_inany src; union inany_addr dst; @@ -868,7 +869,8 @@ void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, continue; }
- tosidx = udp_flow_from_sock(c, frompif, &dst, port, &src, now); + tosidx = udp_flow_from_sock(c, frompif, &dst, port, &src, + rule_hint, now); topif = pif_at_sidx(tosidx);
if (pif_is_socket(topif)) { @@ -910,8 +912,10 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now) { - if (events & (EPOLLERR | EPOLLIN)) - udp_sock_fwd(c, ref.fd, ref.listen.pif, ref.listen.port, now); + if (events & (EPOLLERR | EPOLLIN)) { + udp_sock_fwd(c, ref.fd, ref.listen.rule, + ref.listen.pif, ref.listen.port, now); + } }
/** diff --git a/udp_flow.c b/udp_flow.c index 8907f2f7..b82c6c06 100644 --- a/udp_flow.c +++ b/udp_flow.c @@ -139,6 +139,7 @@ static int udp_flow_sock(const struct ctx *c, * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @rule_hint: Index of forwarding rule, or -1 if unknown * @now: Timestamp * * Return: sidx for the target side of the new UDP flow, or FLOW_SIDX_NONE @@ -147,13 +148,13 @@ static int udp_flow_sock(const struct ctx *c, * #syscalls getsockname */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int rule_hint, const struct timespec *now) { struct udp_flow *uflow = NULL; const struct flowside *tgt; unsigned sidei;
- if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) + if (!(tgt = flow_target(c, flow, rule_hint, IPPROTO_UDP))) goto cancel;
uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); @@ -216,6 +217,7 @@ cancel: * @dst: Our (local) address to which the datagram is arriving * @port: Our (local) port number to which the datagram is arriving * @s_in: Source socket address, filled in by recvmmsg() + * @rule_hint: Index of forwarding rule, or -1 if unknown * @now: Timestamp * * #syscalls fcntl arm:fcntl64 ppc64:fcntl64|fcntl i686:fcntl64 @@ -226,7 +228,7 @@ cancel: flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, const union inany_addr *dst, in_port_t port, const union sockaddr_inany *s_in, - const struct timespec *now) + int rule_hint, const struct timespec *now) { const struct flowside *ini; struct udp_flow *uflow; @@ -260,7 +262,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, return FLOW_SIDX_NONE; }
- return udp_flow_new(c, flow, now); + return udp_flow_new(c, flow, rule_hint, now); }
/** @@ -316,7 +318,7 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c, return FLOW_SIDX_NONE; }
- return udp_flow_new(c, flow, now); + return udp_flow_new(c, flow, -1, now); }
/** @@ -332,7 +334,7 @@ static void udp_flush_flow(const struct ctx *c, { /* We don't know exactly where the datagrams will come from, but we know * they'll have an interface and oport matching this flow */ - udp_sock_fwd(c, uflow->s[sidei], uflow->f.pif[sidei], + udp_sock_fwd(c, -1, uflow->s[sidei], uflow->f.pif[sidei],
I guess this is fixed on your local branch because this patch (and only this one) applied with some fuzz, but in case it's not... this passes -1 as socket ('s' argument), not as 'rule_hint'. If I swap it with uflow->s[sidei] it all works as expected.
uflow->f.side[sidei].oport, now); }
diff --git a/udp_flow.h b/udp_flow.h index 4c528e95..14e0f920 100644 --- a/udp_flow.h +++ b/udp_flow.h @@ -35,7 +35,7 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx); flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, const union inany_addr *dst, in_port_t port, const union sockaddr_inany *s_in, - const struct timespec *now); + int rule_hint, const struct timespec *now); flow_sidx_t udp_flow_from_tap(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, diff --git a/udp_internal.h b/udp_internal.h index 96d11cff..f0ce8f14 100644 --- a/udp_internal.h +++ b/udp_internal.h @@ -28,7 +28,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, const struct flowside *toside, size_t dlen, bool no_udp_csum); -void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, - in_port_t port, const struct timespec *now); +void udp_sock_fwd(const struct ctx *c, int s, int rule_hint, + uint8_t frompif, in_port_t port, const struct timespec *now);
#endif /* UDP_INTERNAL_H */
So, uh, having reached this point, I was asking myself "is this really everything"? :) Well, it's not exactly everything as you mentioned in the cover letter, perhaps the biggest missing part (other than client / interface) is outbound forwarding, but still, reaching this point with just 14 relatively small patches is quite impressive. I guess it's also merit of how generic / well structured the flow table and especially the flow "stages" are. -- Stefano
On Tue, 13 Jan 2026 16:12:02 +1100
David Gibson
On Tue, Jan 13, 2026 at 12:26:10AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:36 +1100 David Gibson
wrote: +/** + * fwd_rules_print() - Print forwarding rules for debugging + * @fwd: Table to print + */ +void fwd_rules_print(const struct fwd_ports *fwd) +{ + unsigned i; + + for (i = 0; i < fwd->count; i++) { + const struct fwd_rule *rule = &fwd->rules[i]; + const char *weak = rule->flags & FWD_WEAK ? " WEAK" : "";
Should we print " might fail" or " can fail" instead of " WEAK"? This is for users.
Good point. THough I'm not sure "might fail" or "can fail" is terribly clear in context either. I've gone with " (best effort)" for now.
Or maybe " (if available)"? I always find "best effort" a bit ambiguous because, well, it's a pretty good effort, being it's the best one, but it actually means we'll just give it a quick try, once. No strong preference though, "best effort" is rather idiomatic anyway. -- Stefano
On Tue, 13 Jan 2026 16:28:27 +1100
David Gibson
On Tue, Jan 13, 2026 at 12:26:22AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:39 +1100 David Gibson
wrote: @@ -74,6 +83,8 @@ enum fwd_ports_mode { * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between the original destination and mapped port number + * @listen_sock_count: Number of entries used in @listen_socks + * @listen_socks: Listening sockets for forwarding
To keep those aligned:
/** * fwd_ports() - Describes port forwarding for one protocol and direction * @mode: Overall forwarding mode (all, none, auto, some ports) * @scan4: /proc/net fd to scan for IPv4 ports when in AUTO mode * @scan6: /proc/net fd to scan for IPv6 ports when in AUTO mode * @count: Number of forwarding rules * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between original and mapped destination port * @listen_sock_count: Number of entries used in @listen_socks * @listen_socks: Listening sockets for forwarding */
Done. I'd also be very open to more succinct names for these fields, but they haven't occurred to me yet.
@sockets: Array of listening sockets for inbound port forwarding @sockets_count: Count of used @sockets ? I mean, they're obviously listening sockets. -- Stefano
On Tue, 13 Jan 2026 16:38:58 +1100
David Gibson
On Tue, Jan 13, 2026 at 12:26:36AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:41 +1100 David Gibson
wrote: Previously we created inbound listening sockets as we parsed the forwarding options (-t, -u) whereas outbound listening sockets were created during {tcp,udp}_init(). Now that we have a data structure recording the full details of the listening options we can move all listening socket creation to {tcp,udp}_init(). This means that errors for either direction are detected and reported the same way.
Introduce fwd_listen_sync() which synchronizes the state of listening sockets to the forwarding rules table, both for fixed and automatic forwards.
This does cause a change in semantics for "exclude only" port specifications. Previously an option like -t ~6000 wouldn't cause a fatal error, as long as we could bind at least one port. Now, it requires at least one port for each generated rule; that is for each of the contiguous blocks of ports the specification resolves to. With typical ephemeral ports settings that's one port each in 1..5999, 6001..32767 and 61000..65535.
Preserving the exact behaviour for this case would require a considerably more complex data structure, so I'm hoping this is a sufficiently niche case for the change to be acceptable.
I guess so too, I wouldn't really worry.
Well, I'm not sure if it works, but one relatively simple idea could be to have a "with_prev" bit in the rule struct representing the fact that the current rule was derived from the same port specification as the previous rule, which implies they would need to be deleted all together (but we can happily enforce that).
Then, in the fwd_listen_sync_() loop, before reporting failure, you would check the next entry: if the "with_prev" bit is set, report failure only if we fail (keeping a local boolean flag) for all the entries up to the first one with "with_prev" unset.
I'll keep that approach in mind if it seems like we need it.
I would be inclined to say it's worth it if it's that simple, but I haven't tried, so I might be very well missing something.
I also considered making WEAK mean we'd always continue on listen failures, even if all of them fail. Maybe that's a bit unexpected? But it would allow an option to "forward single port X, if you can" which seems like it might be useful.
I think that's a different feature, perhaps it would need its own flag. But I don't think we should have it on by default with the current command-line interface. Perhaps the client should have a different command/specifier to add a set of ranges which can all fail entirely. But I guess it risks being a bit too much for the command line. By the way, that should make more sense once we add the possibility of specifying ports or ranges for automatic port forwarding, or for https://bugs.passt.top/show_bug.cgi?id=131, because at that point it becomes "when you can"... and we would already have that implemented, actually. -- Stefano
On Tue, Jan 13, 2026 at 11:13:31PM +0100, Stefano Brivio wrote:
On Tue, 13 Jan 2026 16:28:27 +1100 David Gibson
wrote: On Tue, Jan 13, 2026 at 12:26:22AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:39 +1100 David Gibson
wrote: @@ -74,6 +83,8 @@ enum fwd_ports_mode { * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between the original destination and mapped port number + * @listen_sock_count: Number of entries used in @listen_socks + * @listen_socks: Listening sockets for forwarding
To keep those aligned:
/** * fwd_ports() - Describes port forwarding for one protocol and direction * @mode: Overall forwarding mode (all, none, auto, some ports) * @scan4: /proc/net fd to scan for IPv4 ports when in AUTO mode * @scan6: /proc/net fd to scan for IPv6 ports when in AUTO mode * @count: Number of forwarding rules * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded * @delta: Offset between original and mapped destination port * @listen_sock_count: Number of entries used in @listen_socks * @listen_socks: Listening sockets for forwarding */
Done. I'd also be very open to more succinct names for these fields, but they haven't occurred to me yet.
@sockets: Array of listening sockets for inbound port forwarding @sockets_count: Count of used @sockets
? I mean, they're obviously listening sockets.
I wasn't sure about that. But that fact it's obvious to you is evidence in that direction. So, I've gone with 'sock_count' and 'socks'. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 11:12:01PM +0100, Stefano Brivio wrote:
Silly nits only and a couple of remarks that will probably be clarified at a later point:
On Thu, 8 Jan 2026 13:29:45 +1100 David Gibson
wrote: We now have a formal array of forwarding rules. However, we don't actually consult it when we forward a new flow. Instead we rely on (a) implicit information (we wouldn't be here if there wasn't a listening socket for the rule) and (b) the legacy delta[] data structure.
Start addressing this, by searching for a matching forwarding rule when attempting to forward a new flow. For now this is incomplete: * We only do this for socket-initiated flows * We make a potentially costly linear lookup * We don't actually use the matching rule for anything yet
We'll address each of those in later patches.
Signed-off-by: David Gibson
--- flow.c | 43 ++++++++++++++++++++++++++++++++++--------- fwd.c | 33 +++++++++++++++++++++++++++++++++ fwd.h | 2 ++ 3 files changed, 69 insertions(+), 9 deletions(-) diff --git a/flow.c b/flow.c index 4f534865..045e9712 100644 --- a/flow.c +++ b/flow.c @@ -489,7 +489,7 @@ struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, struct flowside *flow_target(const struct ctx *c, union flow *flow, uint8_t proto) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr[INANY_ADDRSTRLEN], ostr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; const struct flowside *ini = &f->side[INISIDE]; struct flowside *tgt = &f->side[TGTSIDE]; @@ -500,6 +500,30 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); ASSERT(flow->f.state == FLOW_STATE_INI);
+ if (pif_is_socket(f->pif[INISIDE])) { + const struct fwd_ports *fwd; + + if (f->pif[INISIDE] == PIF_HOST && proto == IPPROTO_TCP) + fwd = &c->tcp.fwd_in; + else if (f->pif[INISIDE] == PIF_HOST && proto == IPPROTO_UDP) + fwd = &c->udp.fwd_in; + else if (f->pif[INISIDE] == PIF_SPLICE && proto == IPPROTO_TCP) + fwd = &c->tcp.fwd_out; + else if (f->pif[INISIDE] == PIF_SPLICE && proto == IPPROTO_UDP) + fwd = &c->udp.fwd_out; + else + goto nofwd; + + if (!fwd_rule_search(fwd, ini)) { + /* This shouldn't happen, because if there's no rule for + * it we should have no listening socket that would let + * us get here + */ + flow_dbg(flow, "Unexpected missing forward rule"); + goto nofwd; + } + } + switch (f->pif[INISIDE]) { case PIF_TAP: memcpy(f->tap_omac, MAC_UNDEF, ETH_ALEN); @@ -514,22 +538,23 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, tgtpif = fwd_nat_from_host(c, proto, ini, tgt); fwd_neigh_mac_get(c, &tgt->oaddr, f->tap_omac); break; - default: - flow_err(flow, "No rules to forward %s [%s]:%hu -> [%s]:%hu", - pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), - ini->eport, - inany_ntop(&ini->oaddr, fstr, sizeof(fstr)), - ini->oport); + goto nofwd; }
if (tgtpif == PIF_NONE) - return NULL; + goto nofwd;
f->pif[TGTSIDE] = tgtpif; flow_set_state(f, FLOW_STATE_TGT); return tgt; + +nofwd: + flow_err(flow, "No rules to forward %s %s [%s]:%hu -> [%s]:%hu", + pif_name(f->pif[INISIDE]), ipproto_name(proto), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), ini->eport, + inany_ntop(&ini->oaddr, ostr, sizeof(ostr)), ini->oport);
This assumes we're still using this function for inbound forwards only (eaddr / eport -> oaddr / oport), perhaps we'll want a macro once it starts being used for the other way around as well (if at all).
No, that's correct for outbound as well. Both the addresses are from the initiating side (there is no target side, because we can't forward). It's always src -> dst for the exchange initiating the flow.
By the way, for rules, earlier in this series, you used "=>" to separate source and target of the forward, here it's still "->", I guess we should settle on one version (it just occurred to me while testing stuff: it might be useful to grep in logs).
As above, this is actually consistent. => separates sides, both addresses are on the initiating side here.
+ return NULL; }
/** diff --git a/fwd.c b/fwd.c index 3f088fd2..633ba5db 100644 --- a/fwd.c +++ b/fwd.c @@ -409,6 +409,39 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, } }
+/** + * fwd_rule_match() - Does a prospective flow match a given forwarding rule
Does [...]?
Fixed.
+ * @rule: Forwarding rule + * @ini: Initiating side flow information + * + * Returns: true if the rule applies to the flow, false otherwise + */ +static bool fwd_rule_match(const struct fwd_rule *rule, + const struct flowside *ini) +{ + return inany_matches(&ini->oaddr, fwd_rule_addr(rule)) && + ini->oport >= rule->first && ini->oport <= rule->last;
The usual alignment is:
return inany_matches(&ini->oaddr, fwd_rule_addr(rule)) && ini->oport >= rule->first && ini->oport <= rule->last;
Huh. My emacs config almost always matches the passt style, but not in this case (I think it treats bare expressions like this differently from function parameters).
+} + +/** + * fwd_rule_search() - Find a rule which matches a prospective flow + * @fwd: Forwarding table + * @ini: Initiating side flow information + * + * Returns: first matching rule, or NULL if there is none
I guess that this will eventually need to become a function matching the most specific rule first, tie breakers could be:
Maybe, but I'm hoping not. The current model I have in mind is always first rule matches - if you want most specific first, then you need to order them like that. Potentially re-ordering by specificity is something the update client could do. In any case, I think that's something we want to avoid doing at the point we actually look up the table.
1. specific address given vs. wildcard 2. specific interface given vs. no interface 3. the day we support/need it: specific port/range vs. no port 4. smallest port range 5. the day we support/need something like this: longest prefix length
I'm not sure specificity rules will always give a total order. Having an explicit order lets is disambiguate such cases.
and after this we should actually have an error on insertion (already guaranteed I think).
That's a good idea. Even while it's first-rule-wins, we could give an error/warning if a rule is added that's unreachable because an earlier one will always win. I think that's slightly more complicated than the conflicting rules checking I already implemented, but not dramatically so.
+ */ +const struct fwd_rule *fwd_rule_search(const struct fwd_ports *fwd, + const struct flowside *ini) +{ + unsigned i; + + for (i = 0; i < fwd->count; i++) { + if (fwd_rule_match(&fwd->rules[i], ini)) + return &fwd->rules[i]; + }
Extra newline here to clearly separate the two outcomes.
Done.
+ return NULL; +} + /** * fwd_rules_print() - Print forwarding rules for debugging * @fwd: Table to print diff --git a/fwd.h b/fwd.h index f84e7c01..a10bdbb4 100644 --- a/fwd.h +++ b/fwd.h @@ -103,6 +103,8 @@ struct fwd_ports { void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, const union inany_addr *addr, const char *ifname, in_port_t first, in_port_t last, in_port_t to); +const struct fwd_rule *fwd_rule_search(const struct fwd_ports *fwd, + const struct flowside *ini); void fwd_rules_print(const struct fwd_ports *fwd);
void fwd_scan_ports_init(struct ctx *c);
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 11:13:36PM +0100, Stefano Brivio wrote:
On Tue, 13 Jan 2026 16:38:58 +1100 David Gibson
wrote: On Tue, Jan 13, 2026 at 12:26:36AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:41 +1100 David Gibson
wrote: Previously we created inbound listening sockets as we parsed the forwarding options (-t, -u) whereas outbound listening sockets were created during {tcp,udp}_init(). Now that we have a data structure recording the full details of the listening options we can move all listening socket creation to {tcp,udp}_init(). This means that errors for either direction are detected and reported the same way.
Introduce fwd_listen_sync() which synchronizes the state of listening sockets to the forwarding rules table, both for fixed and automatic forwards.
This does cause a change in semantics for "exclude only" port specifications. Previously an option like -t ~6000 wouldn't cause a fatal error, as long as we could bind at least one port. Now, it requires at least one port for each generated rule; that is for each of the contiguous blocks of ports the specification resolves to. With typical ephemeral ports settings that's one port each in 1..5999, 6001..32767 and 61000..65535.
Preserving the exact behaviour for this case would require a considerably more complex data structure, so I'm hoping this is a sufficiently niche case for the change to be acceptable.
I guess so too, I wouldn't really worry.
Well, I'm not sure if it works, but one relatively simple idea could be to have a "with_prev" bit in the rule struct representing the fact that the current rule was derived from the same port specification as the previous rule, which implies they would need to be deleted all together (but we can happily enforce that).
Then, in the fwd_listen_sync_() loop, before reporting failure, you would check the next entry: if the "with_prev" bit is set, report failure only if we fail (keeping a local boolean flag) for all the entries up to the first one with "with_prev" unset.
I'll keep that approach in mind if it seems like we need it.
I would be inclined to say it's worth it if it's that simple, but I haven't tried, so I might be very well missing something.
I also considered making WEAK mean we'd always continue on listen failures, even if all of them fail. Maybe that's a bit unexpected? But it would allow an option to "forward single port X, if you can" which seems like it might be useful.
I think that's a different feature, perhaps it would need its own flag. But I don't think we should have it on by default with the current command-line interface.
Perhaps the client should have a different command/specifier to add a set of ranges which can all fail entirely. But I guess it risks being a bit too much for the command line.
That's fair. Hmm... for the client protocol, we could potentially return the number of sockets we managed to bind(), and the client could decide if it's good enough. Setting up a listening rule already can't really be atomic, so I think it's probably ok for the client to have to explicitly delete/rollback. Not sure if that's a good idea, but it's a thought.
By the way, that should make more sense once we add the possibility of specifying ports or ranges for automatic port forwarding, or for https://bugs.passt.top/show_bug.cgi?id=131, because at that point it becomes "when you can"... and we would already have that implemented, actually.
Right. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 11:13:26PM +0100, Stefano Brivio wrote:
On Tue, 13 Jan 2026 16:12:02 +1100 David Gibson
wrote: On Tue, Jan 13, 2026 at 12:26:10AM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:36 +1100 David Gibson
wrote: +/** + * fwd_rules_print() - Print forwarding rules for debugging + * @fwd: Table to print + */ +void fwd_rules_print(const struct fwd_ports *fwd) +{ + unsigned i; + + for (i = 0; i < fwd->count; i++) { + const struct fwd_rule *rule = &fwd->rules[i]; + const char *weak = rule->flags & FWD_WEAK ? " WEAK" : "";
Should we print " might fail" or " can fail" instead of " WEAK"? This is for users.
Good point. THough I'm not sure "might fail" or "can fail" is terribly clear in context either. I've gone with " (best effort)" for now.
Or maybe " (if available)"?
I feel like that's also kind of vague. Available where, exactly?
I always find "best effort" a bit ambiguous because, well, it's a pretty good effort, being it's the best one, but it actually means we'll just give it a quick try,
I'm thinking of it as "best effort" for the rule as a whole: we attempt it for every component port.
once.
That's no longer true after 7/14: we'll call fwd_listen_sync() every second from fwd_scan_ports_timer(). That's intended for the FWD_SCAN rules, but it also means we'll re-attempt listens for any FWD_WEAK ports we previously failed on.
No strong preference though, "best effort" is rather idiomatic anyway.
I'll keep it as-is for now, but I'm open to different wordings if a clearly better one occurs to one of us. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 11:12:26PM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:46 +1100 David Gibson
wrote: Currently we remap port numbers based on the legacy delta[] array, which is indexed only by original port number, not the listening address. Now that we look up a forwarding rule entry in flow_target(), we can use this entry to directly determine the correct remapped port. Implement this, and remove the old delta[] array.
Link: https://bugs.passt.top/show_bug.cgi?id=187
Signed-off-by: David Gibson
--- flow.c | 7 ++++--- fwd.c | 21 +++++++-------------- fwd.h | 7 +++---- 3 files changed, 14 insertions(+), 21 deletions(-) diff --git a/flow.c b/flow.c index 045e9712..38f88732 100644 --- a/flow.c +++ b/flow.c @@ -493,6 +493,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, struct flow_common *f = &flow->f; const struct flowside *ini = &f->side[INISIDE]; struct flowside *tgt = &f->side[TGTSIDE]; + const struct fwd_rule *rule = NULL;
Coverity Scan complains about two cases where this NULL rule would be passed to NAT functions dereferencing it. They should be both false positives because the rule is always set on pif_is_socket(...), but I wonder how fragile it is.
Ah, yeah. I'm not terribly surprised. I'll see what I can do.
Regardless, it would be nice to avoid adding further warnings, if it's cheap (ASSERT() or check on rule in fwd_nat_from*() functions?). Anyway, here you go:
/home/sbrivio/passt/flow.c:535:3: Type: Explicit null dereferenced (FORWARD_NULL)
/home/sbrivio/passt/flow.c:496:2: 1. assign_zero: Assigning: "rule" = "NULL". /home/sbrivio/passt/flow.c:499:2: 2. path: Condition "flow_new_entry == flow", taking true branch. /home/sbrivio/passt/flow.c:499:2: 3. path: Condition "f->state == FLOW_STATE_INI", taking true branch. /home/sbrivio/passt/flow.c:500:2: 4. path: Condition "f->type == FLOW_TYPE_NONE", taking true branch. /home/sbrivio/passt/flow.c:501:2: 5. path: Condition "f->pif[0] != PIF_NONE", taking true branch. /home/sbrivio/passt/flow.c:501:2: 6. path: Condition "f->pif[1] == PIF_NONE", taking true branch. /home/sbrivio/passt/flow.c:502:2: 7. path: Condition "flow->f.state == FLOW_STATE_INI", taking true branch. /home/sbrivio/passt/flow.c:504:2: 8. path: Condition "pif_is_socket(f->pif[0])", taking false branch. /home/sbrivio/passt/flow.c:528:2: 9. path: Switch case value "PIF_SPLICE". /home/sbrivio/passt/flow.c:535:3: 10. var_deref_model: Passing null pointer "rule" to "fwd_nat_from_splice", which dereferences it. /home/sbrivio/passt/fwd.c:991:2: 10.1. path: Condition "!inany_is_loopback(&ini->eaddr)", taking false branch. /home/sbrivio/passt/fwd.c:991:2: 10.2. path: Condition "!inany_is_loopback(&ini->oaddr)", taking false branch. /home/sbrivio/passt/fwd.c:1008:2: 10.3. path: Condition "proto == IPPROTO_UDP", taking true branch. /home/sbrivio/passt/fwd.c:1012:2: 10.4. dereference: Dereferencing pointer "rule".
/home/sbrivio/passt/flow.c:539:3: Type: Explicit null dereferenced (FORWARD_NULL)
/home/sbrivio/passt/flow.c:496:2: 1. assign_zero: Assigning: "rule" = "NULL". /home/sbrivio/passt/flow.c:499:2: 2. path: Condition "flow_new_entry == flow", taking true branch. /home/sbrivio/passt/flow.c:499:2: 3. path: Condition "f->state == FLOW_STATE_INI", taking true branch. /home/sbrivio/passt/flow.c:500:2: 4. path: Condition "f->type == FLOW_TYPE_NONE", taking true branch. /home/sbrivio/passt/flow.c:501:2: 5. path: Condition "f->pif[0] != PIF_NONE", taking true branch. /home/sbrivio/passt/flow.c:501:2: 6. path: Condition "f->pif[1] == PIF_NONE", taking true branch. /home/sbrivio/passt/flow.c:502:2: 7. path: Condition "flow->f.state == FLOW_STATE_INI", taking true branch. /home/sbrivio/passt/flow.c:504:2: 8. path: Condition "pif_is_socket(f->pif[0])", taking false branch. /home/sbrivio/passt/flow.c:528:2: 9. path: Switch case value "PIF_HOST". /home/sbrivio/passt/flow.c:539:3: 10. var_deref_model: Passing null pointer "rule" to "fwd_nat_from_host", which dereferences it. /home/sbrivio/passt/fwd.c:1069:2: 10.1. dereference: Dereferencing pointer "rule".
uint8_t tgtpif = PIF_NONE;
ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); @@ -514,7 +515,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, else goto nofwd;
- if (!fwd_rule_search(fwd, ini)) { + if (!(rule = fwd_rule_search(fwd, ini))) { /* This shouldn't happen, because if there's no rule for * it we should have no listening socket that would let * us get here @@ -531,11 +532,11 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, break;
case PIF_SPLICE: - tgtpif = fwd_nat_from_splice(c, proto, ini, tgt); + tgtpif = fwd_nat_from_splice(rule, proto, ini, tgt); break;
case PIF_HOST: - tgtpif = fwd_nat_from_host(c, proto, ini, tgt); + tgtpif = fwd_nat_from_host(c, rule, proto, ini, tgt); fwd_neigh_mac_get(c, &tgt->oaddr, f->tap_omac); break; default: diff --git a/fwd.c b/fwd.c index 633ba5db..7c4575ff 100644 --- a/fwd.c +++ b/fwd.c @@ -405,7 +405,6 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, /* Fill in the legacy forwarding data structures to match the table */ if (!(new->flags & FWD_SCAN)) bitmap_set(fwd->map, port); - fwd->delta[port] = new->to - new->first; } }
@@ -978,7 +977,7 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
/** * fwd_nat_from_splice() - Determine to forward a flow from the splice interface - * @c: Execution context + * @rule: Forwarding rule to apply * @proto: Protocol (IP L4 protocol number) * @ini: Flow address information of the initiating side * @tgt: Flow address information on the target side (updated) @@ -986,7 +985,7 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, * Return: pif of the target interface to forward the flow to, PIF_NONE if the * flow cannot or should not be forwarded at all. */ -uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, +uint8_t fwd_nat_from_splice(const struct fwd_rule *rule, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { if (!inany_is_loopback(&ini->eaddr) || @@ -1010,11 +1009,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, /* But for UDP preserve the source port */ tgt->oport = ini->eport;
- tgt->eport = ini->oport; - if (proto == IPPROTO_TCP) - tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; - else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_out.delta[tgt->eport]; + tgt->eport = ini->oport - rule->first + rule->to;
This made me notice that the current documentation to struct fwd_rule doesn't really make it clear that n : 1 port mapping rules are not a thing,
Not yet, but that's a pretty obvious extension for later.
so, maybe, back to 2/14, instead of:
* @to: Port number to forward port @first to.
we could have perhaps:
* @to: Target port for @first, port n maps to @to + (n - @first)
?
Seems reasonable, done.
By the way, I find this notation a bit complicated to figure out, I think that:
rule->to + (ini->oport - rule->first)
Good idea, done.
(redundant parentheses included) is actually clearer. Or maybe even with a temporary 'delta' variable.
In both cases: the subtraction now happens in in_port_t, so I guess we should explicitly cast rule->first to int to be strictly correct.
No, it should be correct to do it within a u16. At this point we know that @first <= port <= @last, so the subtraction can't over or underflow. I guess the addition could, strictly speaking. But 16-bit unsigned overflow has the correct semantics for this. I guess that does allow remapping to port 0, which is problematic so maybe we should enforce that (@to + @last - @first) < NUM_PORTS when we add rules.
return PIF_HOST; } @@ -1058,6 +1053,7 @@ bool nat_inbound(const struct ctx *c, const union inany_addr *addr, /** * fwd_nat_from_host() - Determine to forward a flow from the host interface * @c: Execution context + * @rule: Forwarding rule to apply * @proto: Protocol (IP L4 protocol number) * @ini: Flow address information of the initiating side * @tgt: Flow address information on the target side (updated) @@ -1065,15 +1061,12 @@ bool nat_inbound(const struct ctx *c, const union inany_addr *addr, * Return: pif of the target interface to forward the flow to, PIF_NONE if the * flow cannot or should not be forwarded at all. */ -uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, +uint8_t fwd_nat_from_host(const struct ctx *c, + const struct fwd_rule *rule, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { /* Common for spliced and non-spliced cases */ - tgt->eport = ini->oport; - if (proto == IPPROTO_TCP) - tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; - else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_in.delta[tgt->eport]; + tgt->eport = ini->oport - rule->first + rule->to;
Same as above.
Done.
if (!c->no_splice && inany_is_loopback(&ini->eaddr) && (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { diff --git a/fwd.h b/fwd.h index a10bdbb4..cfe9ed46 100644 --- a/fwd.h +++ b/fwd.h @@ -82,7 +82,6 @@ enum fwd_ports_mode { * @count: Number of forwarding rules * @rules: Array of forwarding rules * @map: Bitmap describing which ports are forwarded - * @delta: Offset between the original destination and mapped port number * @listen_sock_count: Number of entries used in @listen_socks * @listen_socks: Listening sockets for forwarding */ @@ -93,7 +92,6 @@ struct fwd_ports { unsigned count; struct fwd_rule rules[MAX_FWD_RULES]; uint8_t map[PORT_BITMAP_SIZE]; - in_port_t delta[NUM_PORTS]; unsigned listen_sock_count; int listen_socks[MAX_LISTEN_SOCKS]; }; @@ -117,9 +115,10 @@ bool nat_inbound(const struct ctx *c, const union inany_addr *addr, union inany_addr *translated); uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt); -uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, +uint8_t fwd_nat_from_splice(const struct fwd_rule *rule, uint8_t proto, const struct flowside *ini, struct flowside *tgt); -uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, +uint8_t fwd_nat_from_host(const struct ctx *c, + const struct fwd_rule *rule, uint8_t proto, const struct flowside *ini, struct flowside *tgt); void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr, const uint8_t *mac, bool permanent);
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 11:12:35PM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:47 +1100 David Gibson
wrote: Now that we have a table of all our forwarding rules, every listening socket can be associated with a specific rule. Add an index allowing us to locate that rule from the socket's epoll reference. We don't use it yet, but we'll use it to optimise rule lookup when forwarding new flows.
Signed-off-by: David Gibson
--- fwd.c | 15 ++++++++++----- fwd.h | 5 +++++ tcp.c | 4 +++- tcp.h | 5 ++--- udp.c | 4 +++- udp.h | 5 ++--- 6 files changed, 25 insertions(+), 13 deletions(-) diff --git a/fwd.c b/fwd.c index 7c4575ff..6727d26f 100644 --- a/fwd.c +++ b/fwd.c @@ -474,6 +474,7 @@ void fwd_rules_print(const struct fwd_ports *fwd)
/** fwd_sync_one() - Create or remove listening sockets for a forward entry * @c: Execution context + * @fwd: Forwarding table * @rule: Forwarding rule * @pif: Interface to create listening sockets for * @proto: Protocol to listen for @@ -481,19 +482,23 @@ void fwd_rules_print(const struct fwd_ports *fwd) * * Return: 0 on success, -1 on failure */ -static int fwd_sync_one(const struct ctx *c, const struct fwd_rule *rule, +static int fwd_sync_one(const struct ctx *c, + const struct fwd_ports *fwd, const struct fwd_rule *rule, uint8_t pif, uint8_t proto, const uint8_t *scanmap) { const union inany_addr *addr = fwd_rule_addr(rule); const char *ifname = rule->ifname; bool bound_one = false; - unsigned port; + unsigned port, idx;
ASSERT(pif_is_socket(pif));
if (!*ifname) ifname = NULL;
+ idx = rule - fwd->rules; + ASSERT(idx < MAX_FWD_RULES); + for (port = rule->first; port <= rule->last; port++) { int fd = rule->socks[port - rule->first];
@@ -514,9 +519,9 @@ static int fwd_sync_one(const struct ctx *c, const struct fwd_rule *rule, }
if (proto == IPPROTO_TCP) - fd = tcp_listen(c, pif, addr, ifname, port); + fd = tcp_listen(c, pif, idx, addr, ifname, port); else if (proto == IPPROTO_UDP) - fd = udp_listen(c, pif, addr, ifname, port); + fd = udp_listen(c, pif, idx, addr, ifname, port); else ASSERT(0);
@@ -588,7 +593,7 @@ static int fwd_listen_sync_(void *arg) ns_enter(a->c);
for (i = 0; i < a->fwd->count; i++) { - a->ret = fwd_sync_one(a->c, &a->fwd->rules[i], + a->ret = fwd_sync_one(a->c, a->fwd, &a->fwd->rules[i], a->pif, a->proto, a->fwd->map); if (a->ret < 0) break; diff --git a/fwd.h b/fwd.h index cfe9ed46..435f422a 100644 --- a/fwd.h +++ b/fwd.h @@ -48,14 +48,19 @@ struct fwd_rule { * union fwd_listen_ref - information about a single listening socket * @port: Bound port number of the socket * @pif: pif in which the socket is listening + * @rule: Index of forwarding rule */ union fwd_listen_ref { struct { in_port_t port; uint8_t pif; +#define FWD_RULE_BITS 8 + unsigned rule :FWD_RULE_BITS; }; uint32_t u32; }; +static_assert(sizeof(union fwd_listen_ref) == sizeof(uint32_t));
Why do we need this, specifically?
It goes into the data field of the epoll_ref so it has to be exactly 32-bits. With the bitfields, it's maybe not instantly obvious that the structure isn't larger than that. In particular, this relies on the compiler not inserting padding between @pif and @rule; since alignof(unsigned) == 4, typically, I was concerned it might. Even if that is guaranteed by the C standard, I think it's nicer not to require the reader to know that.
+static_assert(MAX_FWD_RULES <= (1U << FWD_RULE_BITS));
I start wondering if instead of having a 'rule' field supporting 256 rules, with 128 as maximum number of rules, we could just have 256 as maximum number of rules and use the usual MAX_FROM_BITS() macro to keep things simpler.
Good idea, done. Btw, as a later change, I'm considering merging the four forwarding tables into one. If that's done we don't need @pif in the epoll_ref any more (it will be in the rule), and we'll have 16-bits of space if we need to expand the rule table
After all, it's not really rules[] taking space:
Certainly.
struct fwd_ports { enum fwd_ports_mode mode; /* 0 4 */ int scan4; /* 4 4 */ int scan6; /* 8 4 */ unsigned int count; /* 12 4 */ struct fwd_rule rules[128]; /* 16 7168 */ /* --- cacheline 112 boundary (7168 bytes) was 16 bytes ago --- */ uint8_t map[8192]; /* 7184 8192 */ /* --- cacheline 240 boundary (15360 bytes) was 16 bytes ago --- */ unsigned int listen_sock_count; /* 15376 4 */ int listen_socks[196608]; /* 15380 786432 */
/* size: 801816, cachelines: 12529, members: 8 */ /* padding: 4 */ /* last cacheline: 24 bytes */ };
enum fwd_ports_mode { FWD_UNSET = 0, diff --git a/tcp.c b/tcp.c index e9b440da..fc03e38f 100644 --- a/tcp.c +++ b/tcp.c @@ -2672,18 +2672,20 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, * tcp_listen() - Create listening socket * @c: Execution context * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) + * @rule: Index of relevant forwarding rule * @addr: Pointer to address for binding, NULL for any * @ifname: Name of interface to bind to, NULL for any * @port: Port, host order * * Return: Socket fd on success, negative error code on failure */ -int tcp_listen(const struct ctx *c, uint8_t pif, +int tcp_listen(const struct ctx *c, uint8_t pif, unsigned rule, const union inany_addr *addr, const char *ifname, in_port_t port) { union fwd_listen_ref ref = { .port = port, .pif = pif, + .rule = rule, }; int s;
diff --git a/tcp.h b/tcp.h index 45f97d93..24b90870 100644 --- a/tcp.h +++ b/tcp.h @@ -18,9 +18,8 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, uint32_t flow_lbl, const struct pool *p, int idx, const struct timespec *now); -int tcp_listen(const struct ctx *c, uint8_t pif, - const union inany_addr *addr, const char *ifname, - in_port_t port); +int tcp_listen(const struct ctx *c, uint8_t pif, unsigned rule, + const union inany_addr *addr, const char *ifname, in_port_t port); int tcp_init(struct ctx *c); void tcp_timer(const struct ctx *c, const struct timespec *now); void tcp_defer_handler(struct ctx *c); diff --git a/udp.c b/udp.c index 92a87198..761221f6 100644 --- a/udp.c +++ b/udp.c @@ -1115,18 +1115,20 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, * udp_listen() - Initialise listening socket for a given port * @c: Execution context * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) + * @rule: Index of relevant forwarding rule * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order * * Return: Socket fd on success, negative error code on failure */ -int udp_listen(const struct ctx *c, uint8_t pif, +int udp_listen(const struct ctx *c, uint8_t pif, unsigned rule, const union inany_addr *addr, const char *ifname, in_port_t port) { union fwd_listen_ref ref = { .pif = pif, .port = port, + .rule = rule, }; int s;
diff --git a/udp.h b/udp.h index 3c6f90a9..2b91d728 100644 --- a/udp.h +++ b/udp.h @@ -14,9 +14,8 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, uint8_t ttl, const struct pool *p, int idx, const struct timespec *now); -int udp_listen(const struct ctx *c, uint8_t pif, - const union inany_addr *addr, const char *ifname, - in_port_t port); +int udp_listen(const struct ctx *c, uint8_t pif, unsigned rule, + const union inany_addr *addr, const char *ifname, in_port_t port); int udp_init(struct ctx *c); void udp_update_l2_buf(const unsigned char *eth_d);
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Jan 13, 2026 at 11:13:03PM +0100, Stefano Brivio wrote:
On Thu, 8 Jan 2026 13:29:48 +1100 David Gibson
wrote: Now that listening sockets include a reference to the forwarding rule which created them we can, in many cases, avoid a linear search of the forwarding table when we want to find the relevant rule. Instead we can take the rule index from the socket's epoll reference, and use that to immediately find the correct rule.
This is conceptually simple, but requires a moderate amount of plumbing to get the index from the reference through to the rule lookup. We still allow fall back to linear search if we don't have the index, and this may (rarely) be used in the udp_flush_flow() case, where we could get packets for one flow on a different flow's socket, rather than through a listening socket as usual.
Signed-off-by: David Gibson
--- flow.c | 13 +++++++++++-- flow_table.h | 2 +- fwd.c | 3 +-- fwd.h | 1 + icmp.c | 2 +- tcp.c | 4 ++-- udp.c | 14 +++++++++----- udp_flow.c | 14 ++++++++------ udp_flow.h | 2 +- udp_internal.h | 4 ++-- 10 files changed, 37 insertions(+), 22 deletions(-) diff --git a/flow.c b/flow.c index 38f88732..85759970 100644 --- a/flow.c +++ b/flow.c @@ -482,12 +482,13 @@ struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, * flow_target() - Determine where flow should forward to, and move to TGT * @c: Execution context * @flow: Flow to forward + * @rule_hint: Index of relevant forwarding rule, or -1 if unknown * @proto: Protocol * * Return: pointer to the target flowside information */ struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto) + int rule_hint, uint8_t proto) { char estr[INANY_ADDRSTRLEN], ostr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; @@ -515,7 +516,15 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, else goto nofwd;
- if (!(rule = fwd_rule_search(fwd, ini))) { + if (rule_hint >= 0) { + ASSERT((unsigned)rule_hint < fwd->count); + rule = &fwd->rules[rule_hint]; + if (!fwd_rule_match(rule, ini)) { + flow_dbg(flow, + "Unexpected mismatching forward rule"); + goto nofwd; + } + } else if (!(rule = fwd_rule_search(fwd, ini))) {
*If* we need this fwd_rule_search() / fwd_rule_match() / else if fwd_rule_search() pattern in more places, it might be convenient to hide the complexity by taking 'rule_hint' as argument for fwd_rule_search() perhaps, and then do something like this in fwd_rule_search():
for (i = hint == -1 ? 0 : hint; i < fwd->count; i++) { if (fwd_rule_match(&fwd->rules[i], ini)) return &fwd->rules[i]; if (hint) break; }
return NULL;
but I haven't really thought this through. It makes fwd_rule_search() a bit ugly.
I considered that approach, but decided against it. However, looking again I think that might make dealing with the coverity issue in the earlier patch a bit cleaner, so I'm reconsidering.
/* This shouldn't happen, because if there's no rule for * it we should have no listening socket that would let * us get here diff --git a/flow_table.h b/flow_table.h index 5ee13acc..73de13ba 100644 --- a/flow_table.h +++ b/flow_table.h @@ -207,7 +207,7 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto); + int rule_hint, uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/fwd.c b/fwd.c index 6727d26f..250b470c 100644 --- a/fwd.c +++ b/fwd.c @@ -415,8 +415,7 @@ void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, * * Returns: true if the rule applies to the flow, false otherwise */ -static bool fwd_rule_match(const struct fwd_rule *rule, - const struct flowside *ini) +bool fwd_rule_match(const struct fwd_rule *rule, const struct flowside *ini) { return inany_matches(&ini->oaddr, fwd_rule_addr(rule)) && ini->oport >= rule->first && ini->oport <= rule->last; diff --git a/fwd.h b/fwd.h index 435f422a..2af7791d 100644 --- a/fwd.h +++ b/fwd.h @@ -106,6 +106,7 @@ struct fwd_ports { void fwd_rule_add(struct fwd_ports *fwd, uint8_t flags, const union inany_addr *addr, const char *ifname, in_port_t first, in_port_t last, in_port_t to); +bool fwd_rule_match(const struct fwd_rule *rule, const struct flowside *ini); const struct fwd_rule *fwd_rule_search(const struct fwd_ports *fwd, const struct flowside *ini); void fwd_rules_print(const struct fwd_ports *fwd); diff --git a/icmp.c b/icmp.c index 9564c496..0f4d48bb 100644 --- a/icmp.c +++ b/icmp.c @@ -183,7 +183,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL;
flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - if (!(tgt = flow_target(c, flow, proto))) + if (!(tgt = flow_target(c, flow, -1, proto))) goto cancel;
if (flow->f.pif[TGTSIDE] != PIF_HOST) { diff --git a/tcp.c b/tcp.c index fc03e38f..89b22a51 100644 --- a/tcp.c +++ b/tcp.c @@ -1657,7 +1657,7 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, ini = flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport);
- if (!(tgt = flow_target(c, flow, IPPROTO_TCP))) + if (!(tgt = flow_target(c, flow, -1, IPPROTO_TCP))) goto cancel;
if (flow->f.pif[TGTSIDE] != PIF_HOST) { @@ -2496,7 +2496,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref, goto cancel; }
- if (!flow_target(c, flow, IPPROTO_TCP)) + if (!flow_target(c, flow, ref.listen.rule, IPPROTO_TCP)) goto cancel;
switch (flow->f.pif[TGTSIDE]) { diff --git a/udp.c b/udp.c index 761221f6..d7e47e52 100644 --- a/udp.c +++ b/udp.c @@ -838,12 +838,13 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n, * udp_sock_fwd() - Forward datagrams from a possibly unconnected socket * @c: Execution context * @s: Socket to forward from + * @rule_hint: Forwarding rule to use, or -1 if unknown * @frompif: Interface to which @s belongs * @port: Our (local) port number of @s * @now: Current timestamp */ -void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, - in_port_t port, const struct timespec *now) +void udp_sock_fwd(const struct ctx *c, int s, int rule_hint, + uint8_t frompif, in_port_t port, const struct timespec *now) { union sockaddr_inany src; union inany_addr dst; @@ -868,7 +869,8 @@ void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, continue; }
- tosidx = udp_flow_from_sock(c, frompif, &dst, port, &src, now); + tosidx = udp_flow_from_sock(c, frompif, &dst, port, &src, + rule_hint, now); topif = pif_at_sidx(tosidx);
if (pif_is_socket(topif)) { @@ -910,8 +912,10 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now) { - if (events & (EPOLLERR | EPOLLIN)) - udp_sock_fwd(c, ref.fd, ref.listen.pif, ref.listen.port, now); + if (events & (EPOLLERR | EPOLLIN)) { + udp_sock_fwd(c, ref.fd, ref.listen.rule, + ref.listen.pif, ref.listen.port, now); + } }
/** diff --git a/udp_flow.c b/udp_flow.c index 8907f2f7..b82c6c06 100644 --- a/udp_flow.c +++ b/udp_flow.c @@ -139,6 +139,7 @@ static int udp_flow_sock(const struct ctx *c, * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @rule_hint: Index of forwarding rule, or -1 if unknown * @now: Timestamp * * Return: sidx for the target side of the new UDP flow, or FLOW_SIDX_NONE @@ -147,13 +148,13 @@ static int udp_flow_sock(const struct ctx *c, * #syscalls getsockname */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int rule_hint, const struct timespec *now) { struct udp_flow *uflow = NULL; const struct flowside *tgt; unsigned sidei;
- if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) + if (!(tgt = flow_target(c, flow, rule_hint, IPPROTO_UDP))) goto cancel;
uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); @@ -216,6 +217,7 @@ cancel: * @dst: Our (local) address to which the datagram is arriving * @port: Our (local) port number to which the datagram is arriving * @s_in: Source socket address, filled in by recvmmsg() + * @rule_hint: Index of forwarding rule, or -1 if unknown * @now: Timestamp * * #syscalls fcntl arm:fcntl64 ppc64:fcntl64|fcntl i686:fcntl64 @@ -226,7 +228,7 @@ cancel: flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, const union inany_addr *dst, in_port_t port, const union sockaddr_inany *s_in, - const struct timespec *now) + int rule_hint, const struct timespec *now) { const struct flowside *ini; struct udp_flow *uflow; @@ -260,7 +262,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, return FLOW_SIDX_NONE; }
- return udp_flow_new(c, flow, now); + return udp_flow_new(c, flow, rule_hint, now); }
/** @@ -316,7 +318,7 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c, return FLOW_SIDX_NONE; }
- return udp_flow_new(c, flow, now); + return udp_flow_new(c, flow, -1, now); }
/** @@ -332,7 +334,7 @@ static void udp_flush_flow(const struct ctx *c, { /* We don't know exactly where the datagrams will come from, but we know * they'll have an interface and oport matching this flow */ - udp_sock_fwd(c, uflow->s[sidei], uflow->f.pif[sidei], + udp_sock_fwd(c, -1, uflow->s[sidei], uflow->f.pif[sidei],
I guess this is fixed on your local branch because this patch (and only this one) applied with some fuzz, but in case it's not... this passes -1 as socket ('s' argument), not as 'rule_hint'.
Oops, no. Not sure where the fuzz came from, but this was wrong in my tree..
If I swap it with uflow->s[sidei] it all works as expected.
.. but not any more.
uflow->f.side[sidei].oport, now); }
diff --git a/udp_flow.h b/udp_flow.h index 4c528e95..14e0f920 100644 --- a/udp_flow.h +++ b/udp_flow.h @@ -35,7 +35,7 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx); flow_sidx_t udp_flow_from_sock(const struct ctx *c, uint8_t pif, const union inany_addr *dst, in_port_t port, const union sockaddr_inany *s_in, - const struct timespec *now); + int rule_hint, const struct timespec *now); flow_sidx_t udp_flow_from_tap(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, diff --git a/udp_internal.h b/udp_internal.h index 96d11cff..f0ce8f14 100644 --- a/udp_internal.h +++ b/udp_internal.h @@ -28,7 +28,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, const struct flowside *toside, size_t dlen, bool no_udp_csum); -void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, - in_port_t port, const struct timespec *now); +void udp_sock_fwd(const struct ctx *c, int s, int rule_hint, + uint8_t frompif, in_port_t port, const struct timespec *now);
#endif /* UDP_INTERNAL_H */
So, uh, having reached this point, I was asking myself "is this really everything"? :)
Well, it's not exactly everything as you mentioned in the cover letter, perhaps the biggest missing part (other than client / interface) is outbound forwarding, but still, reaching this point with just 14 relatively small patches is quite impressive.
I guess it's also merit of how generic / well structured the flow table and especially the flow "stages" are.
Thank you? As also noted in the cover letter, I think this is now complete enough to be interesting. There is a lot still to do though: * Allow configuration of the target address as well as matching address * Allow addresses to be configured for outbound / splice mappings * Allow n to 1 port remapping * Remove port bitmap (or rather make them local variables only) * This is a bit complicated because if we want to implement port scanning for passt, we might need it to be persistent again * Allow "auto" mappings to be configured * Allow "auto" mappings to be for specific bind addresses * Outbound forwarding table * Configuring specific NATs * ? Merging separate tcp/udp in/out tables into one * Remove global forwarding mode (other than as a local during conf) * Probably a bunch more I'm not remembering And, of course, dynamic updates. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (3)
-
David Gibson
-
Laurent Vivier
-
Stefano Brivio