These operations might require slow I/O, and thus might block PID1's main
loop for an undeterminated amount of time. Instead of performing them
inline, fork a worker process and stash away the D-Bus message, and reply
once we get a SIGCHILD indicating they have completed. That way we don't
break compatibility and callers can continue to rely on the fact that when
they get the method reply the operation either succeeded or failed.
To keep backward compatibility, unlike reload control processes, these
are ran inside init.scope and not the target cgroup. Unlike ExecReload,
this is under our control and is not defined by the unit. This is necessary
because previously the operation also wasn't ran from the target cgroup,
so suddenly forking a copy-on-write copy of pid1 into the target cgroup
will make memory usage spike, and if there is a MemoryMax= or MemoryHigh=
set and the cgroup is already close to the limit, it will cause an OOM
kill, where previously it would have worked fine.
Follow-up for 7ac58157ca
With the mentioned commit, iff E2BIG we'd retry pidfd_spawn()
with POSIX_SPAWN_SETCGROUP disabled. However, the same strategy
should actually apply to EOPNOTSUPP/ENOSYS/EPERM too -
they can mean two things here: no clone3() or no CLONE_PIDFD.
Therefore, let's first try clone() + CLONE_PIDFD, and fall further back
to plain clone() (posix_spawn()) only as last resort. Plus, record
the fact so that we don't unnecessarily retry every single time
if CLONE_PIDFD is the one that's unavailable.
In some kernels (specifically, 5.4) even though the clone3 syscall is
supported, setting CLONE_INTO_CGROUP is not. The error message returned
in this case is E2BIG.
If posix_spawn_wrapper encounters this error, it does not retry, and
cannot spawn any programs in said kernels.
This commit adds a check for the E2BIG error and retries pidfd_spawn()
without the POSIX_SPAWN_SETCGROUP flag.
If we encounter an E2BIG error, and the pidfd_spawn() succeeds after
removing the POSIX_SPAWN_SETCGROUP flag, then we cache the result so
that we do not retry every time.
Originally, this issue was reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1077204.
Signed-off-by: Kornilios Kourtis <kornilios@gmail.com>
let's instead generate ENOTTY on our own. This is more correct with out
coding style (since we generally do not propagate errors via errno), and
also addresses #34039 as side effect. (#34039 really needs to be fixed
in musl though, too, this is just a work-around as side-effect).
Fixes: #34039
glibc returs EIO on ttys that are hung up. That's not really correct,
POSIX seems to disagree.
Work around this in our code, and turn this into a clean "1", since a
hung up tty doesn't stop being a tty just because it is hung up.
Background: https://github.com/systemd/systemd/pull/34039
The kernel parses FRA_SUPPRESS_PREFIXLEN as uint32_t, but internally
handled as signed integer and negative values as unset. Let's explicitly
specify the size of the variable.
No functional change, just refactoring.
Before this commit, the "Cannot raise nice level" branch
is rather confusing, as we're actually lowering the nice.
Also, it's better to log about the final nice value
for both cases, no matter whether we need to set to limit
or not.
Follow-up for 6d2984d21b
The current semantics of "filtered" in unit_is_filtered()
are actually the contrary of ListUnitsFiltered(). Let's
make things consistent, i.e. return true when the unit
shall be included.
When running unprivileged, checking /proc/1/root doesn't work because
it requires privileges. Instead, let's add an environment variable so
the process that chroot's can tell (systemd) subprocesses whether
they're running in a chroot or not.
When pid 1 crashes, the getty unit for the console will happily keep
running which means we end up with two shells competing for the same
tty. Let's call vhangup on /dev/console to kill every other process
attached to the console before we spawn the crash shell. The getty
units have Restart=always but lucky for us, pid 1 just crashed in fire
and flames so it isn't actually able to restart the getty unit.
gcc15 has -Wunterminated-string-initialization in -Wextra and
warns about string constants that are not null terminated even though
the functions do do out of bounds access.
Silence the warnings by simply not providing an explicit size.
The s390x platform provides confidential VMs using the "Secure Execution"
technology, which is also referred to as "Protected Virtualization" or
just "prot virt" in Linux / QEMU.
This can be detected through a simple sysfs attribute.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
We have different impls of detect_confidential_virtualization per
architecture. The detection is cached in the x86_64 impl, and as we
add support for more targets, we want to use caching for all. It thus
makes sense to split caching out into an architecture independent
method.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
cg_kill_kernel_sigkill() has a narrow use case, and currently
no code really reaches that branch. Let's detach it from
cg_kill_recursive() hence, and call it explicitly later
where appropriate.
When removing a cgroup, we always want to eliminate subcgroups
first, i.e. use cg_trim(). And cg_rmdir() (along with
CGROUP_REMOVE flag) is simply unused. Kill it.
All messages logged from exec_spawn() are attributed to the unit
and as such we should set the log level to the unit's max log level
for the duration of the function.
The original CVM detection logic for TDX assumes that the guest can see
the standard TDX CPUID leaf. This was true in Azure when this code was
originally written, however, current Azure now blocks that leaf in the
paravisor. Instead it is required to use the same Azure specific CPUID
leaf that is used for SEV-SNP detection, which reports the VM isolation
type.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Similar to the implementation of cgroup.kill in the kernel, let's
skip kernel threads in cg_kill_items() as trying to kill kernel
threads as an unprivileged process will fail with EPERM and doesn't
do anything when running privileged.
Currently inhibitors are bypassed unless an explicit request is made to
check for them, or even in that case when the requestor is root or the
same uid as the holder of the lock.
But in many cases this makes it impractical to rely on inhibitor locks.
For example, in Debian there are several convoluted and archaic
workarounds that divert systemctl/reboot to some hacky custom scripts
to try and enforce blocking accidental reboots, when it's not expected
that the requestor will remember to specify the command line option
to enable checking for active inhibitor locks.
Also in many cases one wants to ensure that locks taken by a user are
respected by actions initiated by that same user.
Change logind so that inhibitors checks are not skipped in these
cases, and systemctl so that locks are checked in order to show a
friendly error message rather than "permission denied".
Add new block-weak and delay-weak modes that keep the previous
behaviour unchanged.
Currently, IS_SYNTHETIC_ERRNO() evaluates to true for all negative errnos,
because of the two's-complement negative value representation.
Subsequently, ERRNO= is not logged for most of our own code.
Let's fix this, by formatting all synthetic errnos as positive.
Then, treat all negative values as non-synthetic.
While at it, mark the evaluation order explicitly, and remove
unneeded comment.
Fixes#33800