Let's switch things around, and move the internals of safe_fork_full() into
pidref_safe_fork_full() and make safe_fork_full() a trivial wrapper on top
of pidref_safe_fork_full().
Note that this drops a lot of "const" qualifiers on PidRef arguments.
That's because pidref_is_self() suddenly might end changing the PidRef
because it acquires the pidfd ID.
We had this previously already with pidfd_equal(), but this amplifies
the problem.
I guess we C's "const" doesn't really work for stuff that contains
caches, that is just conceptually constant, but not actually.
Often we want to fork off a process that just hangs until we kill it,
let's add a simple flag to create one of this type, and use it at
various places.
(This is useful for the test case added in the next commit, where it's
kinda nice being able to use pidref_safe_fork_full() and acquiring a
pidref of the child in the child in one go. There's no other value in
this than a bit of synctactic sugar for that test. But otoh thre's no
good reason to prohibit FORK_WAIT use like this, hence either way, this
commit should be a good thing.)
This makes sure when we are blocking signals in preparation for fork()
we'll not temporarily unblock any signals previously set, by mistake.
It's safe for us to block more, but not to unblock signals already
blocked. Fix that.
Fixes: #35470
Previously, if pid == 0 and we're PID 1, get_process_ppid()
would set ret to getppid(), i.e. 0, which is inconsistent
when pid is explicitly set to 1. Ensure we always handle
such case by returning -EADDRNOTAVAIL.
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.
Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.
We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.
When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.
Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.
Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.
Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Setting this flag is a noop without a corresponding call to
posix_spawnattr_setsigdefault.
If we call posix_spawnattr_setsigdefault with a full signal set,
it causes glibc's posix_spawn implementation to call sigaction 63 times,
once for each signal. That seems wasteful.
This feature is really only useful for signals which have their
disposition set to SIG_IGN. Otherwise the dispostion gets set to
SIG_DFL automatically, either by clone(CLONE_CLEAR_SIGHAND) or the
subsequent execve.
As far as I can tell, systemd does not have any signals set to SIG_IGN
under normal operating conditions.
Follow-up for 7ac58157ca
With the mentioned commit, iff E2BIG we'd retry pidfd_spawn()
with POSIX_SPAWN_SETCGROUP disabled. However, the same strategy
should actually apply to EOPNOTSUPP/ENOSYS/EPERM too -
they can mean two things here: no clone3() or no CLONE_PIDFD.
Therefore, let's first try clone() + CLONE_PIDFD, and fall further back
to plain clone() (posix_spawn()) only as last resort. Plus, record
the fact so that we don't unnecessarily retry every single time
if CLONE_PIDFD is the one that's unavailable.
In some kernels (specifically, 5.4) even though the clone3 syscall is
supported, setting CLONE_INTO_CGROUP is not. The error message returned
in this case is E2BIG.
If posix_spawn_wrapper encounters this error, it does not retry, and
cannot spawn any programs in said kernels.
This commit adds a check for the E2BIG error and retries pidfd_spawn()
without the POSIX_SPAWN_SETCGROUP flag.
If we encounter an E2BIG error, and the pidfd_spawn() succeeds after
removing the POSIX_SPAWN_SETCGROUP flag, then we cache the result so
that we do not retry every time.
Originally, this issue was reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1077204.
Signed-off-by: Kornilios Kourtis <kornilios@gmail.com>
Before this commit, the "Cannot raise nice level" branch
is rather confusing, as we're actually lowering the nice.
Also, it's better to log about the final nice value
for both cases, no matter whether we need to set to limit
or not.
Prompted by #32259
We already have this check in exec_invoke(), i.e. child.
But if CLONE_INTO_CGROUP is used, the failure would
occur on parent's side, so do the check there too.
This is useful for situations where an array of FDs is to be passed into
a child process (i.e. by passing it through safe_fork). This function
can be called in the child (before calling exec) to pack the FDs to all
be next to each-other starting from SD_LISTEN_FDS_START (i.e. 3)
The personality() syscall returns a 32-bit value where the top three
bytes are reserved for flags that emulate historical or architectural
quirks, and only the least significant byte reflects the actual
personality we're interested in (in opinionated_personality()).
Use the newly defined mask in the corresponding test as well, otherwise
the test fails on some more "exotic" architectures that set some of the
"quirk" flags:
~# uname -m
armv7l
~# build/test-seccomp
...
/* test_lock_personality */
current personality=0x0
safe_personality(PERSONALITY_INVALID)=0x800000
Assertion '(unsigned long) safe_personality(current) == current' failed at src/test/test-seccomp.c:970, function test_lock_personality(). Aborting.
lockpersonalityseccomp terminated by signal ABRT.
Assertion 'wait_for_terminate_and_check("lockpersonalityseccomp", pid, WAIT_LOG) == EXIT_SUCCESS' failed at src/test/test-seccomp.c:996, function test_lock_personality(). Aborting.
Aborted (core dumped)
See: personality(2) and comments in sys/personality.h
posix_spawnattr_setflags() doesn't OR the input to the current set of flags,
it overwrites them, so we are currently losing POSIX_SPAWN_SETSIGDEF.
Follow-up for: 6ecdfe7d10