On Thu, Oct 13, 2022 at 04:17:30AM +0200, Stefano Brivio wrote:On Tue, 11 Oct 2022 16:40:16 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yeah, I couldn't think of something that was both brief and specific, so I went with brief.We drop our own capabilities, but it's possible that processes we exec() could gain extra privilege via file capabilities. It shouldn't be possible for us to exec() anyway due to seccomp() and our filesystem isolation. But just in case, zero the bounding and inheritable capability sets to prevent any such child from gainin privilege. Note that we do this *after* spawning the pasta shell/command (if any), because we do want the user to be able to give that privilege if they want. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- isolation.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/isolation.c b/isolation.c index 2468f84..e1a024d 100644 --- a/isolation.c +++ b/isolation.c @@ -120,6 +120,61 @@ static void drop_caps_ep_except(uint64_t keep) } } +/** + * clamp_caps() - Prevent any children from gaining caps"clamp" doesn't sound very specific or clear. caps_drop_inherit_bound() would actually tell me what the function does, but it's a bit of a mouthful in comparison. I'm fine with both, really, but if you have a better idea...Ok.+ * + * This drops all capabilities from both the inheritable and the + * bounding set. This means that any exec()ed processes can't gain + * capabilities, even if they have file capabilities which would grant + * them. We shouldn't ever exec() in any case, but this provides an + * additional layer of protection. Executing this requires + * CAP_SETPCAP, which we will have within our userns. + * + * Note that dropping capabilites from the bounding set limits + * exec()ed processes, but does not remove them from the effective or + * permitted sets, so it doesn't reduce our own capabilities. + */ +static void clamp_caps(void) +{ + struct __user_cap_header_struct hdr = { + .version = CAP_VERSION, + .pid = 0, + }; + struct __user_cap_data_struct data[CAP_WORDS];For consistency, I'd move this before hdr.Why what? We're not trying to alter the permitted or effective sets here, so we're doing a capget() first, zeroing the inheritable field, then setting it back again.+ int i; + + for (i = 0; i < 64; i++) { + /* Some errors can be ignored: + * - EINVAL, we'll get this for all values in 0..63 + * that are not actually allocated caps + * - EPERM, we'll get this if we don't have + * CAP_SETPCAP, which can happen if using + * --netns-only. We don't need CAP_SETPCAP for + * normal operation, so carry on without it. + */ + if (prctl(PR_CAPBSET_DROP, i, 0, 0, 0) && + errno != EINVAL && errno != EPERM) { + err("Couldn't drop cap %i from bounding set: %s", + i, strerror(errno)); + exit(EXIT_FAILURE); + } + } + + if (syscall(SYS_capget, &hdr, data)) { + err("Couldn't get current capabilities: %s", strerror(errno)); + exit(EXIT_FAILURE); + } + + for (i = 0; i < CAP_WORDS; i++) + data[i].inheritable = 0;Any specific reason why? Initialisers can have variable sizes to some extent, but if there's a reason why it can't be done, perhaps that would warrant a comment here.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson+ + if (syscall(SYS_capset, &hdr, data)) { + err("Couldn't drop inheritable capabilities: %s", + strerror(errno)); + exit(EXIT_FAILURE); + } +} + /** * isolate_initial() - Early, config independent self isolation * @@ -287,6 +342,7 @@ int isolate_prefork(struct ctx *c) ns_caps |= 1UL << CAP_NET_BIND_SERVICE; } + clamp_caps(); drop_caps_ep_except(ns_caps); return 0;