As discussed on IRC through the last day, here's a more polished version of possible fixes for bug 94 (EPOLLRDHUP storm) and bug 95 (EPOLLERR storm). Both of those were sub-problems arising while investigating podman bug 23686. We're pretty confident about the EPOLLRDHUP fix (bug 94, patch 1/6), based on Stefano's testing. I ended up rewriting Stefano's draft patch for the EPOLLERR case (bug 95, remaining patches), because I thought of a possibility we hadn't discussed yet: we weren't getting an error from the socket error queue, but we might be able to get one with the SO_ERROR getsockopt(). My examination of the kernel code suggests that's plausible, and that if that's the case using SO_ERROR should also clear that error condition. Link: https://bugs.passt.top/show_bug.cgi?id=94 Link: https://bugs.passt.top/show_bug.cgi?id=95 Link: https://github.com/containers/podman/issues/23686 David Gibson (6): flow: Fix incorrect hash probe in flowside_lookup() udp: Allow UDP flows to be prematurely closed flow: Helpers to log details of a flow udp: Split socket error handling out from udp_sock_recv() udp: Treat errors getting errors as unrecoverable udp: Handle more error conditions in udp_sock_errs() flow.c | 53 ++++++++++++++++++++------------ flow.h | 7 +++++ udp.c | 88 +++++++++++++++++++++++++++++++++++++++++++++--------- udp_flow.c | 18 ++++++++++- udp_flow.h | 4 +++ 5 files changed, 136 insertions(+), 34 deletions(-) -- 2.46.0
Our flow hash table uses linear probing in which we step backwards through clusters of adjacent hash entries when we have near collisions. Usually that's implemented by flow_hash_probe(). However, due to some details we need a second implementation in flowside_lookup(). An embarrassing oversight in rebasing from earlier versions has mean that version is incorrect, trying to step forward through clusters rather than backward. In situations with the right sorts of has near-collisions this can lead to us not associating an ACK from the tap device with the right flow, leaving it in a not-quite-established state. If the remote peer does a shutdown() at the right time, this can lead to a storm of EPOLLRDHUP events causing high CPU load. Fixes: acca4235c ("flow, tcp: Generalise TCP hash table to general...") Link: https://bugs.passt.top/show_bug.cgi?id=94 Suggested-by: Stefano Brivio <sbrivio(a)redhat.com> Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/flow.c b/flow.c index 02631eb2..a00e01d1 100644 --- a/flow.c +++ b/flow.c @@ -697,7 +697,7 @@ static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto, !(FLOW_PROTO(&flow->f) == proto && flow->f.pif[sidx.sidei] == pif && flowside_eq(&flow->f.side[sidx.sidei], side))) - b = (b + 1) % FLOW_HASH_SIZE; + b = mod_sub(b, 1, FLOW_HASH_SIZE); return flow_hashtab[b]; } -- 2.46.0
Unlike TCP, UDP has no in-band signalling for the end of a flow. So the only way we remove flows is on a timer if they have no activity for 180s. However, we've started to investigate some error conditions in which we want to prematurely abort / abandon a UDP flow. We can call udp_flow_close(), which will make the flow inert (sockets closed, no epoll events, can't be looked up in hash). However it will still wait 3 minutes to clear away the stale entry. Clean this up by adding an explicit 'closed' flag which will cause a flow to be more promptly cleaned up. We also publish udp_flow_close() so it can be called from other places to abort UDP flows(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 3 ++- udp_flow.c | 18 +++++++++++++++++- udp_flow.h | 4 ++++ 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/flow.c b/flow.c index a00e01d1..f2de0414 100644 --- a/flow.c +++ b/flow.c @@ -832,7 +832,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) closed = icmp_ping_timer(c, &flow->ping, now); break; case FLOW_UDP: - if (timer) + closed = udp_flow_defer(&flow->udp); + if (!closed && timer) closed = udp_flow_timer(c, &flow->udp, now); break; default: diff --git a/udp_flow.c b/udp_flow.c index 1ff59a9b..b81be2cf 100644 --- a/udp_flow.c +++ b/udp_flow.c @@ -39,8 +39,11 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) * @c: Execution context * @uflow: UDP flow */ -static void udp_flow_close(const struct ctx *c, struct udp_flow *uflow) +void udp_flow_close(const struct ctx *c, struct udp_flow *uflow) { + if (uflow->closed) + return; /* Nothing to do */ + if (uflow->s[INISIDE] >= 0) { /* The listening socket needs to stay in epoll */ close(uflow->s[INISIDE]); @@ -56,6 +59,8 @@ static void udp_flow_close(const struct ctx *c, struct udp_flow *uflow) flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); if (!pif_is_socket(uflow->f.pif[TGTSIDE])) flow_hash_remove(c, FLOW_SIDX(uflow, TGTSIDE)); + + uflow->closed = true; } /** @@ -256,6 +261,17 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c, return udp_flow_new(c, flow, -1, now); } +/** + * udp_flow_defer() - Deferred per-flow handling (clean up aborted flows) + * @uflow: Flow to handle + * + * Return: true if the connection is ready to free, false otherwise + */ +bool udp_flow_defer(const struct udp_flow *uflow) +{ + return uflow->closed; +} + /** * udp_flow_timer() - Handler for timed events related to a given flow * @c: Execution context diff --git a/udp_flow.h b/udp_flow.h index 12ddf039..9a1b059c 100644 --- a/udp_flow.h +++ b/udp_flow.h @@ -10,6 +10,7 @@ /** * struct udp - Descriptor for a flow of UDP packets * @f: Generic flow information + * @closed: Flow is already closed * @ts: Activity timestamp * @s: Socket fd (or -1) for each side of the flow */ @@ -17,6 +18,7 @@ struct udp_flow { /* Must be first element */ struct flow_common f; + bool closed :1; time_t ts; int s[SIDES]; }; @@ -30,6 +32,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c, const void *saddr, const void *daddr, in_port_t srcport, in_port_t dstport, const struct timespec *now); +void udp_flow_close(const struct ctx *c, struct udp_flow *uflow); +bool udp_flow_defer(const struct udp_flow *uflow); bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow, const struct timespec *now); -- 2.46.0
The details of a flow - endpoints, interfaces etc. - can be pretty important for debugging. We log this on flow state transitions, but it can also be useful to log this when we report specific conditions. Add some helper functions and macros to make it easy to do that. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 48 +++++++++++++++++++++++++++++++----------------- flow.h | 7 +++++++ 2 files changed, 38 insertions(+), 17 deletions(-) diff --git a/flow.c b/flow.c index f2de0414..1ea112b4 100644 --- a/flow.c +++ b/flow.c @@ -283,28 +283,23 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) "Flow %u (%s): %s", flow_idx(f), type_or_state, msg); } -/** - * flow_set_state() - Change flow's state - * @f: Flow changing state - * @state: New state +/** flow_log_details_() - Log the details of a flow + * @f: flow to log + * @pri: Log priority + * @state: State to log details according to + * + * Logs the details of the flow: endpoints, interfaces, type etc. */ -static void flow_set_state(struct flow_common *f, enum flow_state state) +void flow_log_details_(const struct flow_common *f, int pri, + enum flow_state state) { char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; const struct flowside *tgt = &f->side[TGTSIDE]; - uint8_t oldstate = f->state; - ASSERT(state < FLOW_NUM_STATES); - ASSERT(oldstate < FLOW_NUM_STATES); - - f->state = state; - flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate], - FLOW_STATE(f)); - - if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, + if (state >= FLOW_STATE_TGT) + flow_log_(f, pri, "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), @@ -316,8 +311,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) tgt->oport, inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), tgt->eport); - else if (MAX(state, oldstate) >= FLOW_STATE_INI) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", + else if (state >= FLOW_STATE_INI) + flow_log_(f, pri, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, @@ -325,6 +320,25 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) ini->oport); } +/** + * flow_set_state() - Change flow's state + * @f: Flow changing state + * @state: New state + */ +static void flow_set_state(struct flow_common *f, enum flow_state state) +{ + uint8_t oldstate = f->state; + + ASSERT(state < FLOW_NUM_STATES); + ASSERT(oldstate < FLOW_NUM_STATES); + + f->state = state; + flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate], + FLOW_STATE(f)); + + flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate)); +} + /** * flow_initiate_() - Move flow to INI, setting pif[INISIDE] * @flow: Flow to change state diff --git a/flow.h b/flow.h index d167b654..24ba3ef0 100644 --- a/flow.h +++ b/flow.h @@ -264,4 +264,11 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) flow_dbg((f), __VA_ARGS__); \ } while (0) +void flow_log_details_(const struct flow_common *f, int pri, + enum flow_state state); +#define flow_log_details(f_, pri) \ + flow_log_details_(&((f_)->f), (pri), (f_)->f.state) +#define flow_dbg_details(f_) flow_log_details((f_), LOG_DEBUG) +#define flow_err_details(f_) flow_log_details((f_), LOG_ERR) + #endif /* FLOW_H */ -- 2.46.0
Currently udp_sock_recv() both attempts to clear socket errors and read a batch of datagrams for forwarding. That made sense initially, since both listening and reply sockets need to do this. However, we have certain error cases which will add additional complexity to the error processing. Furthermore, if we ever wanted to more thoroughly handle errors received here - e.g. by synthesising ICMP messages on the tap device - it will likely require different handling for the listening and reply socket cases. So, split handling of error events into its own udp_sock_errs() function. While we're there, allow it to report "unrecoverable errors". We don't have any of these so far, but some cases we're working on might require it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 46 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 40 insertions(+), 6 deletions(-) diff --git a/udp.c b/udp.c index aae6d142..fd91b539 100644 --- a/udp.c +++ b/udp.c @@ -436,6 +436,30 @@ static bool udp_sock_recverr(int s) return true; } +/** + * udp_sock_errs() - Process errors on a socket + * @c: Execution context + * @s: Socket to receive from + * @events: epoll events bitmap + * + * Return: Number of errors handled, or < 0 if we have an unrecoverable error + */ +static int udp_sock_errs(const struct ctx *c, int s, uint32_t events) +{ + unsigned n_err = 0; + + ASSERT(!c->no_udp); + + if (!(events & EPOLLERR)) + return 0; /* Nothing to do */ + + /* Empty the error queue */ + while (udp_sock_recverr(s)) + n_err++; + + return n_err; +} + /** * udp_sock_recv() - Receive datagrams from a socket * @c: Execution context @@ -443,6 +467,8 @@ static bool udp_sock_recverr(int s) * @events: epoll events bitmap * @mmh mmsghdr array to receive into * + * Return: Number of datagrams received + * * #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64 */ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events, @@ -459,12 +485,6 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events, ASSERT(!c->no_udp); - /* Clear any errors first */ - if (events & EPOLLERR) { - while (udp_sock_recverr(s)) - ; - } - if (!(events & EPOLLIN)) return 0; @@ -492,6 +512,13 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, const socklen_t sasize = sizeof(udp_meta[0].s_in); int n, i; + if (udp_sock_errs(c, ref.fd, events) < 0) { + err("UDP: Unrecoverable error on listening socket:" + " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port); + /* FIXME: what now? close/re-open socket? */ + return; + } + if ((n = udp_sock_recv(c, ref.fd, events, udp_mh_recv)) <= 0) return; @@ -566,6 +593,13 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, ASSERT(!c->no_udp && uflow); + if (udp_sock_errs(c, from_s, events) < 0) { + flow_err(uflow, "Unrecoverable error on reply socket"); + flow_err_details(uflow); + udp_flow_close(c, uflow); + return; + } + if ((n = udp_sock_recv(c, from_s, events, udp_mh_recv)) <= 0) return; -- 2.46.0
We can get network errors, usually transient, reported via the socket error queue. However, at least theoretically, we could get errors trying to read the queue itself. Since we have no idea how to clear an error condition in that case, treat it as unrecoverable. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/udp.c b/udp.c index fd91b539..fe5df995 100644 --- a/udp.c +++ b/udp.c @@ -387,11 +387,12 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx, * udp_sock_recverr() - Receive and clear an error from a socket * @s: Socket to receive from * - * Return: true if errors received and processed, false if no more errors + * Return: 1 if error received and processed, 0 if no more errors in queue, < 0 + * if there was an error reading the queue * * #syscalls recvmsg */ -static bool udp_sock_recverr(int s) +static int udp_sock_recverr(int s) { const struct sock_extended_err *ee; const struct cmsghdr *hdr; @@ -408,14 +409,16 @@ static bool udp_sock_recverr(int s) rc = recvmsg(s, &mh, MSG_ERRQUEUE); if (rc < 0) { - if (errno != EAGAIN && errno != EWOULDBLOCK) - err_perror("Failed to read error queue"); - return false; + if (errno == EAGAIN || errno == EWOULDBLOCK) + return 0; + + err_perror("UDP: Failed to read error queue"); + return -1; } if (!(mh.msg_flags & MSG_ERRQUEUE)) { err("Missing MSG_ERRQUEUE flag reading error queue"); - return false; + return -1; } hdr = CMSG_FIRSTHDR(&mh); @@ -424,7 +427,7 @@ static bool udp_sock_recverr(int s) (hdr->cmsg_level == IPPROTO_IPV6 && hdr->cmsg_type == IPV6_RECVERR))) { err("Unexpected cmsg reading error queue"); - return false; + return -1; } ee = (const struct sock_extended_err *)CMSG_DATA(hdr); @@ -433,7 +436,7 @@ static bool udp_sock_recverr(int s) debug("%s error on UDP socket %i: %s", str_ee_origin(ee), s, strerror(ee->ee_errno)); - return true; + return 1; } /** @@ -447,6 +450,7 @@ static bool udp_sock_recverr(int s) static int udp_sock_errs(const struct ctx *c, int s, uint32_t events) { unsigned n_err = 0; + int rc; ASSERT(!c->no_udp); @@ -454,8 +458,11 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events) return 0; /* Nothing to do */ /* Empty the error queue */ - while (udp_sock_recverr(s)) - n_err++; + while ((rc = udp_sock_recverr(s)) > 0) + n_err += rc; + + if (rc < 0) + return -1; /* error reading error, unrecoverable */ return n_err; } -- 2.46.0
udp_sock_errs() reads out everything in the socket error queue. However we've seen some cases[0] where an EPOLLERR event is active, but there isn't anything in the queue. One possibility is that the error is reported instead by the SO_ERROR sockopt. Check for that case and report it as best we can. If we still get an EPOLLERR without visible error, we have no way to clear the error state, so treat it as an unrecoverable error. [0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010 Link: https://bugs.passt.top/show_bug.cgi?id=95 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/udp.c b/udp.c index fe5df995..2ba00c9c 100644 --- a/udp.c +++ b/udp.c @@ -450,7 +450,8 @@ static int udp_sock_recverr(int s) static int udp_sock_errs(const struct ctx *c, int s, uint32_t events) { unsigned n_err = 0; - int rc; + socklen_t errlen; + int rc, err; ASSERT(!c->no_udp); @@ -464,6 +465,24 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events) if (rc < 0) return -1; /* error reading error, unrecoverable */ + errlen = sizeof(err); + if (getsockopt(s, SOL_SOCKET, SO_ERROR, &err, &errlen) < 0 || + errlen != sizeof(err)) { + err_perror("Error reading SO_ERROR"); + return -1; /* error reading error, unrecoverable */ + } + + if (err) { + debug("Unqueued error on UDP socket %i: %s", s, strerror(err)); + n_err++; + } + + if (!n_err) { + /* EPOLLERR, but no errors to clear !? */ + err("EPOLLERR event without reported errors on socket %i", s); + return -1; /* no way to clear, unrecoverable */ + } + return n_err; } -- 2.46.0
On Fri, 6 Sep 2024 15:17:04 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:As discussed on IRC through the last day, here's a more polished version of possible fixes for bug 94 (EPOLLRDHUP storm) and bug 95 (EPOLLERR storm). Both of those were sub-problems arising while investigating podman bug 23686. We're pretty confident about the EPOLLRDHUP fix (bug 94, patch 1/6), based on Stefano's testing. I ended up rewriting Stefano's draft patch for the EPOLLERR case (bug 95, remaining patches), because I thought of a possibility we hadn't discussed yet: we weren't getting an error from the socket error queue, but we might be able to get one with the SO_ERROR getsockopt(). My examination of the kernel code suggests that's plausible, and that if that's the case using SO_ERROR should also clear that error condition. Link: https://bugs.passt.top/show_bug.cgi?id=94 Link: https://bugs.passt.top/show_bug.cgi?id=95 Link: https://github.com/containers/podman/issues/23686Applied. -- Stefano