Let's not leak details from src/shared and src/libsystemd into
src/basic, even though you can't actually do anything useful with
just forward declarations from src/shared.
The sd-forward.h header is put in src/libsystemd/sd-common as we
don't have a directory for shared internal headers for libsystemd
yet.
Let's also rename forward.h to basic-forward.h to keep things
self-explanatory.
Let's switch things around, and move the internals of safe_fork_full() into
pidref_safe_fork_full() and make safe_fork_full() a trivial wrapper on top
of pidref_safe_fork_full().
Note that this drops a lot of "const" qualifiers on PidRef arguments.
That's because pidref_is_self() suddenly might end changing the PidRef
because it acquires the pidfd ID.
We had this previously already with pidfd_equal(), but this amplifies
the problem.
I guess we C's "const" doesn't really work for stuff that contains
caches, that is just conceptually constant, but not actually.
Often we want to fork off a process that just hangs until we kill it,
let's add a simple flag to create one of this type, and use it at
various places.
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.
Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.
We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.
When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.
Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.
Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.
Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
The calls are now unused, and we generally prefer if people send a PID
triplet rather than a single PID, hence stop supporting a high-level
dispacher for pid_t.
The PID_AUTOMATIC value is now properly recognized by the PidRef logic
too. This needed some massaging of header includes, to ensure pidref.h
can access process-util.h's definitions and vice versa.
This is useful for situations where an array of FDs is to be passed into
a child process (i.e. by passing it through safe_fork). This function
can be called in the child (before calling exec) to pack the FDs to all
be next to each-other starting from SD_LISTEN_FDS_START (i.e. 3)
The personality() syscall returns a 32-bit value where the top three
bytes are reserved for flags that emulate historical or architectural
quirks, and only the least significant byte reflects the actual
personality we're interested in (in opinionated_personality()).
Use the newly defined mask in the corresponding test as well, otherwise
the test fails on some more "exotic" architectures that set some of the
"quirk" flags:
~# uname -m
armv7l
~# build/test-seccomp
...
/* test_lock_personality */
current personality=0x0
safe_personality(PERSONALITY_INVALID)=0x800000
Assertion '(unsigned long) safe_personality(current) == current' failed at src/test/test-seccomp.c:970, function test_lock_personality(). Aborting.
lockpersonalityseccomp terminated by signal ABRT.
Assertion 'wait_for_terminate_and_check("lockpersonalityseccomp", pid, WAIT_LOG) == EXIT_SUCCESS' failed at src/test/test-seccomp.c:996, function test_lock_personality(). Aborting.
Aborted (core dumped)
See: personality(2) and comments in sys/personality.h
This combines safe_fork() with pidref_set_pid().
Eventually we really should switch this to use CLONE_PIDFD, but as that
is not wrapped by glibc yet, it's hard. But this is not crucial anyway,
as a child we just forked off can always safely be referenced also by
PID, given the reaping is under our own control.
A simple test case is added in a follow-up commit.
Sometimes it makes sense to hard kill a client if we die. Let's hence
add a third FORK_DEATHSIG flag for this purpose: FORK_DEATHSIG_SIGKILL.
To make things less confusing this also renames FORK_DEATHSIG to
FORK_DEATHSIG_SIGTERM to make clear it sends SIGTERM. We already had
FORK_DEATHSIG_SIGINT, hence this makes things nicely symmetric.
A bunch of users are switched over for FORK_DEATHSIG_SIGKILL where we
know it's safe to abort things abruptly. This should make some kernel
cases more robust, since we cannot get confused by signal masks or such.
While we are at it, also fix a bunch of bugs where we didn't take
FORK_DEATHSIG_SIGINT into account in safe_fork()
For a given PID and namespace type, this helper function gives the PID
of the leader of the namespace containing the given PID. Use this in
systemd-coredump instead of using the existing get_mount_namespace_leader.
This helper will be used again in a later commit.
This wraps glibc's clone() but deals with the 'stack' parameter in a
sensible way. Only supports invocations without CLONE_VM, i.e. when
child is a CoW copy of parent.