capability_get() is a wrapper of capget() syscall and converts its
result to CapabilityQuintet.
This also introduce have_inheritable_cap(), which is similar to
have_effective_cap(). It is currently unused, but will be used later.
capability_quintet_mangle() can be called with capability sets
containing unknown capabilities. Let's not crash when this is the
case but instead ignore the unknown capabilities.
Fixes d5e12dc75e
capability_ambient_set_apply() can be called with capability sets
containing unknown capabilities. Let's not crash when this is the
case but instead ignore the unknown capabilities.
This fixes a crash when running the following command:
"systemd-run -p "AmbientCapabilities=~" --wait --pipe id"
Fixes d5e12dc75e
This generalizes and modernizes the code to acquire set of local caps,
based on the code for this in the condition logic. Uses PidRef, and
acquires the full quintuplet of caps.
This can be considered preparation to one day maybe build without
libcap.
Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.
This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
While stracing PID1's forking off of children I noticed that every
single forked off child reads cap_last_cap from procfs. That value is a
kernel constant, hence we can save a lot of work if we'd cache it.
Thing is, we actually do cache it, in a thread_local cache field. This
means that the forked off processes (which are considered new threads)
will have to re-query it, even though we already know the result.
Hence, let's get rid of the thread_local stuff (given that the value is
going to be the same for all threads anyway, and we pretty much have a
single thread only anyway). Use an C11 atomic_int instead, which ensures
the value is either initialized or not initialized, but we don't need to
be concerned of partial initialization.
This makes the cap_last_cap reading go away in the children, as strace
shows (since cap_last_cap() is already called by PID 1 before
fork()ing, anyway).
fuzzers randomly fail with the following:
```
==172==WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x7f41169cb39b in update_argv /work/build/../../src/systemd/src/basic/argv-util.c:96:13
#1 0x7f41169cb39b in rename_process /work/build/../../src/systemd/src/basic/argv-util.c:210:16
#2 0x7f4116b6824e in safe_fork_full /work/build/../../src/systemd/src/basic/process-util.c:1516:21
#3 0x7f4116bffa36 in safe_fork /work/build/../../src/systemd/src/basic/process-util.h:191:16
#4 0x7f4116bffa36 in parse_timestamp /work/build/../../src/systemd/src/basic/time-util.c:1047:13
#5 0x4a61e6 in LLVMFuzzerTestOneInput /work/build/../../src/systemd/src/fuzz/fuzz-time-util.c:16:16
#6 0x4c4a13 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
#7 0x4c41fa in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:514:3
#8 0x4c58c9 in fuzzer::Fuzzer::MutateAndTestOne() /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:757:19
#9 0x4c6595 in fuzzer::Fuzzer::Loop(std::__Fuzzer::vector<fuzzer::SizedFile, std::__Fuzzer::allocator<fuzzer::SizedFile> >&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:895:5
#10 0x4b58ff in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:912:6
#11 0x4def52 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
#12 0x7f4115ea3082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: e678fe54a5d2c2092f8e47eb0b33105e380f7340)
#13 0x41f5ad in _start (build-out/fuzz-time-util+0x41f5ad)
DEDUP_TOKEN: update_argv--rename_process--safe_fork_full
Uninitialized value was created by an allocation of 'fv' in the stack frame of function 'have_effective_cap'
#0 0x7f41169d3540 in have_effective_cap /work/build/../../src/systemd/src/basic/capability-util.c:21
```
Until now, using any form of seccomp while being unprivileged (User=)
resulted in systemd enabling no_new_privs.
There's no need for doing this because:
* We trust the filters we apply
* If User= is set and a process wants to apply a new seccomp filter, it
will need to set no_new_privs itself
An example of application that might want seccomp + !no_new_privs is a
program that wants to run as an unprivileged user but uses file
capabilities to start a web server on a privileged port while
benefitting from a restrictive seccomp profile.
We now keep the privileges needed to do seccomp before calling
enforce_user() and drop them after the seccomp filters are applied.
If the syscall filter doesn't allow the needed syscalls to drop the
privileges, we keep the previous behavior by enabling no_new_privs.
IN C23, thread_local is a reserved keyword and we shall therefore
do nothing to redefine it. glibc has it defined for older standard
version with the right conditions.
v2 by Yu Watanabe:
Move the definition to missing_threads.h like the way we define e.g.
missing syscalls or missing definitions, and include it by the users.
Co-authored-by: Yu Watanabe <watanabe.yu+github@gmail.com>
We should be more careful with distinguishing the cases "all bits set in
caps mask" from "cap mask invalid". We so far mostly used UINT64_MAX for
both, which is not correct though (as it would mean
AmbientCapabilities=~0 followed by AmbientCapabilities=0) would result
in capability 63 to be set (which we don't really allow, since that
means unset).
We refuse it otherwise currently, simply because we cannot store it in a
uint64_t caps mask value anymore while retaining the ability to use
UINT64_MAX as "unset" marker.
The check actually was in place already, just one off.
Up to now the capability CAP_SETPCAP was raised implicitly in the
function capability_bounding_set_drop.
This functionality is moved into a new function
(capability_gain_cap_setpcap).
The new function optionally provides the capability set as it was
before raisining CAP_SETPCAP.
We never return anything higher than 63, so using "long unsigned"
as the type only confused the reader. (We can still use "long unsigned"
and safe_atolu() to parse the kernel file.)
https://github.com/systemd/systemd/pull/14133 made
capability_ambient_set_apply() acquire capabilities that were explicitly
asked for and drop all others. This change means the function is called
even with an empty capability set, opening up a code path for users
without ambient capabilities to call this function. This function will
error with EINVAL out on kernels < 4.3 because PR_CAP_AMBIENT is not
understood. This turns capability_ambient_set_apply() into a noop for
kernels < 4.3
Fixes https://github.com/systemd/systemd/issues/15225
The OCI changes in #9762 broke a use case in which we use nspawn from
inside a container that has dropped capabilities from the bounding set
that nspawn expected to retain. In an attempt to keep OCI compliance
and support our use case, I made hard failing on setting capabilities
not in the bounding set optional (hard fail if using OCI and log only
if using nspawn cmdline).
Fixes#12539
cap_last_cap() returns the last valid cap (instead of the number of
valid caps). to iterate through all known caps we hence need to use a <=
check, and not a < check like for all other cases. We got this right
usually, but in three cases we did not.