Re: [PATCH v2 16/22] tcp_splice: Improve error reporting on connect path

19 Feb 2024

On Sun, Feb 18, 2024 at 10:01:24PM +0100, Stefano Brivio wrote:
...
On Tue,  6 Feb 2024 12:17:28 +1100
David Gibson  wrote:
...
This makes a number of changes to improve error reporting while connecting
a new spliced socket:
 * We use flow_err() and similar functions so all messages include info
   on which specific flow was affected
 * We use strerror() to interpret raw error values
 * We now report errors on connection (at "trace" level, since this would
   allow spamming the logs)
 * We also look up and report some details on EPOLLERR events, which can
   include connection errors, since we use a non-blocking connect().  Again
   we use "trace" level since this can spam the logs.
Signed-off-by: David Gibson 
---
 tcp_splice.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 5ba9c8ea..49075e5c 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -349,8 +349,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
      ASSERT(0);
if (conn->s[1] < 0) {
-		warn("Couldn't open connectable socket for splice (%d)",
-		     conn->s[1]);
+		flow_err(conn,
+			 "Couldn't open connectable socket for splice: %s",
+			 strerror(-conn->s[1]));
It took me a bit to convince myself that we actually store negative
error codes in pools, which sounds like a neat idea, except for two
things:
Yeah, it took me a while to discover that too :)
...
- in tcp_sock_refill_pool(), once we get an error from
  tcp_conn_new_sock(), should we really continue to call it and try to
  get more sockets?
Presumably, that will have something to do with some kind of resource
  exhaustion, so perhaps we shouldn't risk making it worse and just
  stop refilling the pool. If we do, we might have up to one negative
  error code in the pool, I guess.
Some sort of exhaustion is certainly the most likely cause, yes.
...
- if we have an error code in the pool, that error might have occurred
  a while ago, but here, we're logging it at the moment we need that
  socket.
Fair point.
...
Maybe it would be simpler and saner to just have -1 values in the pool
(error or no socket available) and report errors in tcp_conn_new_sock()
instead?
Yeah, that makes sense.  I'll rework to that goal.
...
I'm still reviewing patches 17/22 to 22/22, sorry for the delay.
-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson