...instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna(a)redhat.com> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- util.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/util.h b/util.h index 27829b1..c498a80 100644 --- a/util.h +++ b/util.h @@ -72,7 +72,7 @@ #define IPV4_IS_LOOPBACK(addr) \ ((addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET) -#define NS_FN_STACK_SIZE (RLIMIT_STACK_VAL * 1024 / 4) +#define NS_FN_STACK_SIZE (RLIMIT_STACK_VAL * 1024 / 8) #define NS_CALL(fn, arg) \ do { \ char ns_fn_stack[NS_FN_STACK_SIZE]; \ -- 2.35.1
On Sat, 22 Oct 2022 08:45:03 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:...instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna(a)redhat.com> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> ---I posted this in any case for (later) review, but I'm actually applying it right away, given that some builds are completely unusable otherwise. I guess it would make sense to extend build ("distro") tests with some compiler flags, common and less common. -- Stefano
On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote:On Sat, 22 Oct 2022 08:45:03 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:That's sensible, however, this patch confuses me. I don't really understand how reducing the stack size is avoiding a SEGV, regardless of what LTO does. The fact that we're basing the runtime stack size on a limit that's from build time also doesn't really make sense to me....instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna(a)redhat.com> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> ---I posted this in any case for (later) review, but I'm actually applying it right away, given that some builds are completely unusable otherwise.I guess it would make sense to extend build ("distro") tests with some compiler flags, common and less common.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Mon, Oct 24, 2022 at 10:36:19AM +1100, David Gibson wrote:On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote:Shortly after I wrote this, I realized what the issue was. IIUC, the problem is basically because we're allocating the stack of the sub-thread as a buffer on the stack of the main thread, so the main thread stack has to have room for both.On Sat, 22 Oct 2022 08:45:03 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:That's sensible, however, this patch confuses me. I don't really understand how reducing the stack size is avoiding a SEGV, regardless of what LTO does....instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna(a)redhat.com> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> ---I posted this in any case for (later) review, but I'm actually applying it right away, given that some builds are completely unusable otherwise.The fact that we're basing the runtime stack size on a limit that's from build time also doesn't really make sense to me.This aspect still seems pretty bogus to me.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonI guess it would make sense to extend build ("distro") tests with some compiler flags, common and less common.
On Mon, 24 Oct 2022 10:52:55 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Mon, Oct 24, 2022 at 10:36:19AM +1100, David Gibson wrote:Right, I could have phrased this better.On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote:Shortly after I wrote this, I realized what the issue was. IIUC, the problem is basically because we're allocating the stack of the sub-thread as a buffer on the stack of the main thread, so the main thread stack has to have room for both.On Sat, 22 Oct 2022 08:45:03 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:That's sensible, however, this patch confuses me. I don't really understand how reducing the stack size is avoiding a SEGV, regardless of what LTO does....instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna(a)redhat.com> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> ---I posted this in any case for (later) review, but I'm actually applying it right away, given that some builds are completely unusable otherwise.It is. It's still a useful approximation, because those limits are rarely set to non-default per-arch values (all the distributions and versions we test happen to have, on a given architecture, the same value), and at the same time vary wildly depending on the architecture. And with that, we avoid bug-prone VLAs or, worse, alloca(). And if users set limits to substantially lower values, typically other programs won't be able to run as well. But... we don't really need that. With 16 KiB, you usually won't be able to ls. With 128 KiB, gimp crashes for me. We probably need something in between, but that implies we shouldn't just give NS_CALL()ed helpers "a lot", rather (some multiple of) what they need. They're not many and they're quite short, so we could note down some manual calculations, or we could automate that with nm, or pahole, or gcc -save-temps... or even some objdump script. I'm not sure that's worth adding a further compilation phase, though. We have well-known architecture-specific type sizes and alignments on Linux and we already do something similar for AVX2 alignments, so I would try to do this manually, take some reasonable margin on top, and maybe add an explicit stack guard (I think we shouldn't rely on -fstack-protector alone, or at all). -- StefanoThe fact that we're basing the runtime stack size on a limit that's from build time also doesn't really make sense to me.This aspect still seems pretty bogus to me.