On Fri, 15 Mar 2024 15:52:50 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote:On Fri, 15 Mar 2024 14:11:40 +0100 Paul Holzinger <pholzing(a)redhat.com> wrote:[{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156On 15/03/2024 12:24, Stefano Brivio wrote:I can reproduce this, it looks rather weird at the moment. Let's say I have two IPv4 routes on the host: $ ip ro sh default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100 if I stop pasta after it re-creates the default route (and not the subnet one, yet), say, at this point (the message with padding confuses strace so we don't get the full decoding, but it's valid): recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")],Martin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE. That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand. Reported-by: Martin Pitt <mpitt(a)redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/netlink.c b/netlink.c index 9e7cccb..20de9b3 100644 --- a/netlink.c +++ b/netlink.c @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */I was about to add my tested-by when I noticed a weird thing, but that happens only on the new kernel as well: On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems we now have the same local route duplicated for some reason?sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68 I see that the kernel already created both in the target namespace: # ip ro sh default via 88.198.0.161 dev eth0 proto dhcp metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164Ah, no, the issue is pre-existing, you can actually see the same on an older kernel, without this patch. The subnet route is added autonomously by the kernel after we add the address via RTM_NEWADDR with a ifa_prefixlen < 32 attribute. Then we add it "again" via RTM_NEWROUTE. We could probably avoid the creation of any route on RTM_NEWADDR giving 32 or 128 as prefix length, because anyway we'll explicitly copy the subnet route later, but that needs a bit more pondering -- I'd say it's beyond the scope of this change. -- Stefano