This will taint systemd if invoked in containers that do not have the
full 16bit range of UIDs defined.
we pretty much need uid root…nobody to be defined for a variety of
purposes, hence let's add this taint flag. Of course taints are
graceful, but it at least communicates the mess in some way...
As suggested in
8b3ad3983f (r837345892)
The define is generalized and moved to path-lookup.h, where it seems to fit
better. This allows a recursive include to be removed and in general makes
things simpler.
Some calls to lookup_path_init() were not followed by any log emission.
E.g.:
$ SYSTEMD_LOG_LEVEL=debug systemctl --root=/missing enable unit; echo $?
1
Let's add a helper function and use it in various places.
$ SYSTEMD_LOG_LEVEL=debug build/systemctl --root=/missing enable unit; echo $?
Failed to initialize unit search paths for root directory /missing: No such file or directory
1
$ SYSTEMCTL_SKIP_SYSV=1 build/systemctl --root=/missing enable unit; echo $?
Failed to initialize unit search paths for root directory /missing: No such file or directory
Failed to enable: No such file or directory.
1
The repeated error in the second case is not very nice, but this is a niche
case and I don't think it's worth the trouble to trying to avoid it.
We would resolve those specifiers to the calling user/group. This is mostly OK
when done in the manager, because the manager generally operates as root
in system mode, and a non-root in user mode. It would still be wrong if
called with --test though. But in systemctl, this would be generally wrong,
since we can call 'systemctl --system' as a normal user, either for testing
or even for actual operation with '--root=…'.
When operating in --global mode, %u/%U/%g/%G should return an error.
The information whether we're operating in system mode, user mode, or global
mode is passed as the data pointer to specifier_group_name(), specifier_user_name(),
specifier_group_id(), specifier_user_id(). We can't use userdata, because
it's already used for other things.
Let's raise our supported baseline a bit: CLOCK_BOOTTIME started to work
with timerfd in kernel 3.15 (i.e. back in 2014), let's require support
for it now.
This will raise our baseline only modestly from 3.13 → 3.15.
Same idea as 03677889f0.
No functional change intended. The type of the iterator is generally changed to
be 'const char*' instead of 'char*'. Despite the type commonly used, modifying
the string was not allowed.
I adjusted the naming of some short variables for clarity and reduced the scope
of some variable declarations in code that was being touched anyway.
To notify user of kill events from systemd-oomd we now use
`SERVICE_FAILURE_OOM_KILL` as the failure result.
`unit_check_oomd_kill` now calls `notify_cgroup_oom` to
update the service result to `oom-kill`.
We add a new xattr `user.oomd_ooms` to keep track of the OOM kills
initiated by systemd-oomd, this helps us resolve a race between sending
SIGKILL to processes and checking for OOM kill status from the xattr.
Related to: #20649
Check if process(es) of a cgroup were killed by Kernel OOM killer
or systemd-oomd before we send the cgroup empty notification.
This allows us to show the right exit state(ServiceResult)
We basically had the same code in three places. Let's unify it in a
common helper function.
event_add_time_change() might be something we should add to the official
sd-event API sooner or later, given its general usefulness.
So far we set the "trusted.delegate" xattr on cgroups where delegation
is on. This duplicates this behaviour with the "user.delegate" xattr.
This has two benefits:
1. unprivileged clients can *read* the xattr. "systemd-cgls" can thus
show delegated cgroups as such properly, even when invoked without
privs
2. unprivileged systemd instances can set the xattr, i.e. when systemd
--user delegates a cgroup to further payloads.
This weakens security a tiny bit, given that code that got a cgroup
delegated can manipulate the xattr, but I think that's OK, given they
have a higher trust level regarding cgroups anyway, if they got a
subtree delegated, and access controls on the cgroup itself are still
enforced. Moreover PID 1 as the cgroup manager only sets these xattrs,
never reads them — the xattr is primarily a way to tell payloads about
the delegation, and it's strictly this one way.
Unprivileged overlayfs is supported since Linux 5.11. The only
change needed to get ExtensionDirectories to work is to avoid
hard-coding the staging directory to the system manager runtime
directory, everything else just works (TM).
This mirrors a similar check in Linux kernel 5.16
(9dcc38e2813e0cd3b195940c98b181ce6ede8f20) that raised the
RLIMIT_MEMLOCK to 8M.
This change does two things: raise the default limit for nspawn
containers (where we try to mimic closely what the kernel does), and
bump it when running on old kernels which still have the lower setting.
Fixes: #16300
See: https://lwn.net/Articles/876288/
The first ExecStartPre or the first ExecStart commands would get the metadata,
but not the subsequent ones. Also check that we do not pass it in
ExecStartPost.
This fixes the following case:
OnFailure= would be spawned correctly, but OnSuccess= would be
spawned without the MONITOR_* metadata, because we'd "collect" the unit
that started successfully. So let's block cleanup while we have a job
running for the handler. The job cannot last infinitely, so at some point
we'll be able to collect both.
We already logged what we are spawning, but not so much why. Let's
add this, so it's easier to distinguish execstartpre/execstart/execstartpost
and such.