Martin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE. That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand. Reported-by: Martin Pitt <mpitt(a)redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/netlink.c b/netlink.c index 9e7cccb..20de9b3 100644 --- a/netlink.c +++ b/netlink.c @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */ -- 2.39.2
On 15/03/2024 12:24, Stefano Brivio wrote:Martin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE. That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand. Reported-by: Martin Pitt <mpitt(a)redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/netlink.c b/netlink.c index 9e7cccb..20de9b3 100644 --- a/netlink.c +++ b/netlink.c @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */I was about to add my tested-by when I noticed a weird thing, but that happens only on the new kernel as well: On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems we now have the same local route duplicated for some reason? I am not sure if it is caused by this patch as I cannot test versions without this patch on a newer kernel. I can however confirm that this patch works and it no longer hangs.
On Fri, 15 Mar 2024 14:11:40 +0100 Paul Holzinger <pholzing(a)redhat.com> wrote:On 15/03/2024 12:24, Stefano Brivio wrote:I can reproduce this, it looks rather weird at the moment. Let's say I have two IPv4 routes on the host: $ ip ro sh default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100 if I stop pasta after it re-creates the default route (and not the subnet one, yet), say, at this point (the message with padding confuses strace so we don't get the full decoding, but it's valid): recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")], [{ nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156 sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68 I see that the kernel already created both in the target namespace: # ip ro sh default via 88.198.0.161 dev eth0 proto dhcp metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 ...and the second one gets the RTA_PREFSRC attribute we just stripped. I'm still trying to find out what might cause this, even if it looks harmless. -- StefanoMartin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE. That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand. Reported-by: Martin Pitt <mpitt(a)redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/netlink.c b/netlink.c index 9e7cccb..20de9b3 100644 --- a/netlink.c +++ b/netlink.c @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */I was about to add my tested-by when I noticed a weird thing, but that happens only on the new kernel as well: On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems we now have the same local route duplicated for some reason?
On Fri, 15 Mar 2024 15:52:50 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote:On Fri, 15 Mar 2024 14:11:40 +0100 Paul Holzinger <pholzing(a)redhat.com> wrote:[{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156On 15/03/2024 12:24, Stefano Brivio wrote:I can reproduce this, it looks rather weird at the moment. Let's say I have two IPv4 routes on the host: $ ip ro sh default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100 if I stop pasta after it re-creates the default route (and not the subnet one, yet), say, at this point (the message with padding confuses strace so we don't get the full decoding, but it's valid): recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")],Martin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE. That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand. Reported-by: Martin Pitt <mpitt(a)redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/netlink.c b/netlink.c index 9e7cccb..20de9b3 100644 --- a/netlink.c +++ b/netlink.c @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */I was about to add my tested-by when I noticed a weird thing, but that happens only on the new kernel as well: On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems we now have the same local route duplicated for some reason?sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68 I see that the kernel already created both in the target namespace: # ip ro sh default via 88.198.0.161 dev eth0 proto dhcp metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164Ah, no, the issue is pre-existing, you can actually see the same on an older kernel, without this patch. The subnet route is added autonomously by the kernel after we add the address via RTM_NEWADDR with a ifa_prefixlen < 32 attribute. Then we add it "again" via RTM_NEWROUTE. We could probably avoid the creation of any route on RTM_NEWADDR giving 32 or 128 as prefix length, because anyway we'll explicitly copy the subnet route later, but that needs a bit more pondering -- I'd say it's beyond the scope of this change. -- Stefano
On 15/03/2024 16:17, Stefano Brivio wrote:On Fri, 15 Mar 2024 15:52:50 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote: > On Fri, 15 Mar 2024 14:11:40 +0100 > Paul Holzinger <pholzing(a)redhat.com> wrote: > >> On 15/03/2024 12:24, Stefano Brivio wrote: >>> Martin reports that, with Fedora Linux kernel version >>> kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, >>> including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same >>> read() as families"), pasta doesn't exit once the network namespace >>> is gone. >>> >>> Actually, pasta is completely non-functional, at least with default >>> options, because nl_route_dup(), which duplicates routes from the >>> parent namespace into the target namespace at start-up, is stuck on >>> a second receive operation for RTM_GETROUTE. >>> >>> However, with that commit, the kernel is now able to fit the whole >>> response, including the NLMSG_DONE message, into a single datagram, >>> so no further messages will be received. >>> >>> It turns out that commit 4d6e9d0816e2 ("netlink: Always process all >>> responses to a netlink request") accidentally relied on the fact that >>> we would always get at least two datagrams as a response to >>> RTM_GETROUTE. >>> >>> That is, the test to check if we expect another datagram, is based >>> on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, >>> but we'll also expect another datagram if NLMSG_OK on the last >>> message is false. But NLMSG_OK with a zero length is always false. >>> >>> The problem is that we don't distinguish if status is zero because >>> we got a NLMSG_DONE message, or because we processed all the >>> available datagram bytes. >>> >>> Introduce an explicit check on NLMSG_DONE. We should probably >>> refactor this slightly, for example by introducing a special return >>> code from nl_status(), but this is probably the least invasive fix >>> for the issue at hand. >>> >>> Reported-by: Martin Pitt <mpitt(a)redhat.com> >>> Link: https://github.com/containers/podman/issues/22052 >>> Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") >>> Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>Tested-by: Paul Holzinger <pholzing(a)redhat.com>Yes, agreed. This fix looks correct and fixes the hang which is the most important thing.[{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156I can reproduce this, it looks rather weird at the moment. Let's say I have two IPv4 routes on the host: $ ip ro sh default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100 if I stop pasta after it re-creates the default route (and not the subnet one, yet), say, at this point (the message with padding confuses strace so we don't get the full decoding, but it's valid): recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")],--- netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/netlink.c b/netlink.c index 9e7cccb..20de9b3 100644 --- a/netlink.c +++ b/netlink.c @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */I was about to add my tested-by when I noticed a weird thing, but that happens only on the new kernel as well: On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems we now have the same local route duplicated for some reason?sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68 I see that the kernel already created both in the target namespace: # ip ro sh default via 88.198.0.161 dev eth0 proto dhcp metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164Ah, no, the issue is pre-existing, you can actually see the same on an older kernel, without this patch. The subnet route is added autonomously by the kernel after we add the address via RTM_NEWADDR with a ifa_prefixlen < 32 attribute. Then we add it "again" via RTM_NEWROUTE. We could probably avoid the creation of any route on RTM_NEWADDR giving 32 or 128 as prefix length, because anyway we'll explicitly copy the subnet route later, but that needs a bit more pondering -- I'd say it's beyond the scope of this change.
On Fri, Mar 15, 2024 at 12:24:32PM +0100, Stefano Brivio wrote:Martin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE.Huh, oops. I feel like the expectation that NLMSG_DONE wold appear in its own datagram was pre-existing, althogh the conseqences of it not being so might have been different and less serious. I recall noticing that the code expected that, and assuming that was part of the ABI.That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand.Reviewed-by: David Gibson <david(a)gibson.dropbear.id.au> On the argument above. I do think we need to check over the netlink code to handle this more carefully, though.Reported-by: Martin Pitt <mpitt(a)redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/netlink.c b/netlink.c index 9e7cccb..20de9b3 100644 --- a/netlink.c +++ b/netlink.c @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson