This allows a service to reuse the user namespace created for an
existing service, similarly to NetworkNamespacePath=. The configuration
is the initial user namespace (e.g. ID mapping) is preserved.
We already have LOG_CONTEXT_PUSH_EXEC() which with two additions
does exactly the same as the custom logging macros, so let's get rid
of the custom logging macros and use LOG_CONTEXT_PUSH_EXEC() instead.
Currently there are various circular dependencies between headers
in core/. Let's get rid of these by making judicious use of forward
declarations and moving includes into implementation files instead of
having them in header files.
Getting rid of circular header includes simplifies the code and makes
various clang based tooling such as iwyu work much better on our code.
The most important change is getting rid of the manager.h include in
unit.h which is possible thanks to the previous commits. We also move
the OOMPolicy and StatusType enums to unit.h to remove the need for
other unit headers to include manager.h to get access to these enums.
This introduce LOG_ITEM() macro that checks arbitrary formats in
log_struct().
Then, drop _printf_ attribute from log_struct_internal(), as it does not
help so much, and compiler checked only the first format string.
Hopefully, this silences false-positive warnings by Coverity.
Since the commit 963b6b906e ("core: drop ambient capabilities in
user manager") systemd running as the session manager has dropped ambient
capabilities retaining other sets allowing user services to be started
with elevated capabilities. This, worked fine until the introduction of
sd-executor. For a non-root process to be started with elevated
capabilities by a non-root parent it either needs file capabilities or
ambient capabilities in the parent process. Thus, systemd needs to allow
sd-executor to inherit its ambient capabilities and sd-executor should
drop them as systemd did before.
The ambient set is managed for both system and session managers, but
with the default set for PID#1 being empty, this code does not affect
operation of PID#1.
Fixes: bb5232b6a3 ("core: add systemd-executor binary")
Follow-up for 4d8b0f0f7a
After the mentioned commit, when the ExecCommand executable is missing,
and failure will be ignored by manager, we exit with EXIT_SUCCESS at executor
side too. The behavior however contradicts systemd.service(5), which states:
> If the executable path is prefixed with "-", an exit code of the command
> normally considered a failure (i.e. non-zero exit status or abnormal exit
> due to signal is _recorded_, but has no further effect and is considered
> equivalent to success.
and thus makes debugging unexpected failures harder. Therefore, let's still
exit with EXIT_EXEC, but just skip LOG_ERR level log.
I was wondering why I couldn't trigger the assertion in safe_fclose()
when submitting #30251. It turned out that the static destructor was
not run at all :/
Replace main() with a minimized version of main-func.h. This also
prevents emitting negative exit codes.
Before this commit, between fdopen() (in parse_argv()) and fdset_remove(),
the serialization fd is owned by both arg_serialization FILE stream and fdset.
Therefore, if something wrong happens between the two calls, or if --deserialize=
is specified more than once, we end up closing the serialization fd twice.
Normally this doesn't matter much, but I still think it's better to fix this.
Let's call fdset_new_fill() after parsing serialization fd hence.
We set the fd to CLOEXEC in parse_argv(), so it will be filtered
when the fdset is created.
While at it, also move fdset_new_fill() under the second log_open(), so
that we always log to the log target specified in arguments.
log_setup() will open the console in systemd-executor because it's
not pid 1 and it's not connected to the journal. So if the log target
is later changed to kmsg, we have to reopen the log.
But since log_open() won't open the same log twice, let's just call it
unconditionally since it will be a noop if we try to reopen the same log.
This makes sure that systemd-executor will log to the log target passed
via --log-target= after parsing arguments.
Loading the SELinux DB on every invocation can be slow and
takes 2ms-10ms, so do not initialize it unconditionally, but
wait for the first use. On a mkosi Fedora rawhide image, this
cuts the number of loads in half.
We use it for more than just pipe() arrays. For example also for
socketpair(). Hence let's give it a generic name.
Also add EBADF_TRIPLET to mirror this for things like
stdin/stdout/stderr arrays, which we use a bunch of times.
No functional changes, only moving code that is only needed in
exec_invoke, and adding new dependencies for seccomp/selinux/apparmor/pam
in meson for the sd-executor binary.
Currently we spawn services by forking a child process, doing a bunch
of work, and then exec'ing the service executable.
There are some advantages to this approach:
- quick: we immediately have access to all the enourmous amount of
state simply by virtue of sharing the memory with the parent
- easy to refactor and add features
- part of the same binary, will never be out of sync
There are however significant drawbacks:
- doing work after fork and before exec is against glibc's supported
case for several APIs we call
- copy-on-write trap: anytime any memory is touched in either parent
or child, a copy of that page will be triggered
- memory footprint of the child process will be memory footprint of
PID1, but using the cgroup memory limits of the unit
The last issue is especially problematic on resource constrained
systems where hard memory caps are enforced and swap is not allowed.
As soon as PID1 is under load, with no page out due to no swap, and a
service with a low MemoryMax= tries to start, hilarity ensues.
Add a new systemd-executor binary, that is able to receive all the
required state via memfd, deserialize it, prepare the appropriate
data structures and call exec_child.
Use posix_spawn which uses CLONE_VM + CLONE_VFORK, to ensure there is
no copy-on-write (same address space will be used, and parent process
will be frozen, until exec).
The sd-executor binary is pinned by FD on startup, so that we can
guarantee there will be no incompatibilities during upgrades.