Re: [PATCH] udp: create and send ICMPv4 to local peer when applicable

13 Feb 2025

On 2025-02-12 16:39, Jon Maloy wrote:
...
On 2025-02-10 19:55, David Gibson wrote:
...
On Sun, Feb 09, 2025 at 12:00:56PM -0500, Jon Maloy wrote:
...
When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.
[...]
...
...
+void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
+            struct in_addr dst, size_t l4len, uint8_t proto)
As discussed on our call, it probably makes sense to alter this to set
the Don't Fragment bit always.
I have been studying the Linux code, but this is handled in so many 
places that it it hard to get a full grip on it. Wireshark logs seem
to indicate that it is always set, and googling a bit on the subject
indicates the same. I think it is safe to do it, so if you agree I
can post a prepending patch setting this.
...
...
  {
      uint16_t l3len = l4len + sizeof(*ip4h);

diff --git a/tap.h b/tap.h
index dfbd8b9..3ba00c1 100644
--- a/tap.h
+++ b/tap.h
@@ -44,6 +44,8 @@ static inline void tap_hdr_update(struct tap_hdr 
*thdr, size_t l2len)
          thdr->vnet_len = htonl(l2len);
  }
[...]
...
...
+ * udp_send_conn_fail_icmp4() - Construct and send ICMP to local peer
+ * @c:              Execution context
+ * @ee:      Extended error descriptor
+ * @ref:      epoll reference
+ * @iov:      First bytes (max 8) of original UDP message body
I don't love passing this as an iov, because that tends to imply there
might be multiple buffers, which is not the case here (both the caller
and callee require a single buffer).
The real reason is that csum_udp4() requires an iov, so if we don't
pass it here we will have to recreate it inside the new function.
...
...
+ */
+static void udp_send_conn_fail_icmp4(const struct ctx *c,
+                     const struct sock_extended_err *ee,
+                     union epoll_ref ref,
+                     struct iovec *iov)
...
+{
+    flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
This is subtly, but badly wrong.  ref.flowside is only valid if the
ref is of a type which uses the flowside field.  That's true for UDP
reply sockets, but *not* for UDP listening sockets, and
udp_sock_errs() is called on both.  I think this function needs to
take an explicit flowside.
Yes, I was a little uncertain about this. Can you give me small code
snippet of how this should be done?
...
...
+    const struct flowside *toside = flowside_at_sidx(tosidx);
+    struct in_addr oaddr = toside->oaddr.v4mapped.a4;
+    struct in_addr eaddr = toside->eaddr.v4mapped.a4;
+    struct iov_tail data = IOV_TAIL(iov, 1, 0);
+    struct {
+        struct icmphdr icmp4h;
+        struct iphdr ip4h;
+        struct udphdr uh;
+        char data[8];
+    } msg;
Needs a ((packed)) attribute to ensure we don't get compiler padding.
I get
udp.c:437:23: warning: taking address of packed member of ‘struct 
<anonymous>’ may result in an unaligned pointer value 
[-Waddress-of-packed-member]
   437 |         tap_push_ip4h(&msg.ip4h, eaddr, oaddr, udplen, 
IPPROTO_UDP);

Also, the sizes of these members are 8+20+8+8, so I cannot see how
this struct can possibly be packed or how there can be an unaligned
pointer.
In short: ((packed)) seems unnecessary, and only causes problems.

/jon
...
...
ok.
...
It took me a minute to work out what was going on here.  Specifically
that ip4h and uh here aren't the headers of the packet we're sending
now, but the reconstructed headers of the packet that prompted the
ICMP error.
Yes, I try to re-create the original message as closely as possible.
...
...
+    size_t udplen = sizeof(msg.uh) + iov->iov_len;
+    size_t msglen = sizeof(msg.icmp4h) + sizeof(msg.ip4h) + udplen;
+
+    memset(&msg, 0, sizeof(msg));
+    msg.icmp4h.type = ee->ee_type;
+    msg.icmp4h.code = ee->ee_code;
+    tap_push_ip4h(&msg.ip4h, eaddr, oaddr, udplen, IPPROTO_UDP);
This isn't quite a natural fit.  It does work, but the tap_push_()
functions are kind of designed to work with a byte buffer where we
only work out the positions of headers as we construct them.  Probably
a clean up for some other day.
...
+    msg.uh.source = htons(toside->eport);
+    msg.uh.dest = htons(toside->oport);
+    msg.uh.len = htons(udplen);
+    csum_udp4(&msg.uh, oaddr, eaddr, &data);
It might make sense to split a tap_push_uh4() function out from
tap_udp4_send() which you could re-use here.
Good point.
...
...
+    memcpy(&msg.data, iov->iov_base, iov->iov_len);
+    tap_icmp4_send(c, oaddr, eaddr, &msg, msglen);
+}
+
  /**
   * udp_sock_recverr() - Receive and clear an error from a socket
- * @s:        Socket to receive from
+ * @c:              Execution context
+ * @ref:      epoll reference
   *
   * Return: 1 if error received and processed, 0 if no more errors 
in queue, < 0
   *         if there was an error reading the queue
   *
   * #syscalls recvmsg
   */
-static int udp_sock_recverr(int s)
+static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref)
Similarly udp_sock_recverr() is used both on "listening" and reply
sockets, so it's not really safe to use the ref here.  You can
explicitly pass the fd easily enough.  The flow is trickier...
I can pass *both* the fd and ref, alternativedly the flowside, but
it looks redundant and adds extra code.
If I add a flowside struct instead of ref I will need to declare
that at the all the multiple locations where udp_sock_errs().
Do you have a good suggestion?
...
...
  {
      const struct sock_extended_err *ee;
      const struct cmsghdr *hdr;
      char buf[CMSG_SPACE(sizeof(*ee))];
+    char udp_data[8];
+    int s = ref.fd;
+    struct iovec iov = {
+        .iov_base = udp_data,
+        .iov_len = sizeof(udp_data)
[...]
...
...
      if (rc < 0)
@@ -558,7 +607,7 @@ static void udp_buf_listen_sock_handler(const 
struct ctx *c,
      const socklen_t sasize = sizeof(udp_meta[0].s_in);
      int n, i;
-    if (udp_sock_errs(c, ref.fd, events) < 0) {
+    if (udp_sock_errs(c, ref, events) < 0) {
... because in the listening socket case we don't know the flow until
we've looked at the packet.  I'd be tempted as a first cut to only do
the ICMP thing for errors on reply sockets.
I can try that, but it would feel a little reduced.
...
Listening sockets will be trickier.  I think we'll have to actually
check msg_name in the received error in order to work out which flow
it belongs to.
Actually, I did that first, until I realized I could extract the source
IP address from the flow object. But how does that help?
///jon