Now that we have an explicit userns check we can drop the heuristic for
it, given that it's kinda wrong (because mapping the full host UID range
into a userns is actually a thing people do).
Hence, just delete the code and only keep the userns inode check in
place.
Now that we have a reliable pidns check I don't think we really should
look for cgroupns anymore, it's too weak a check. I mean, if I myself
would implement a desktop app sandbox (like flatpak) I'd always enable
cgroupns, simply to hide the host cgroup hierarchy.
Hence drop the check.
I suggested adding this 4 years ago here:
https://github.com/systemd/systemd/pull/17902#issuecomment-745548306
- Rename to script_get_shebang_interpreter and return
-EMEDIUMTYPE if the executable is not a script.
We nowadays utilize the scheme of making ret param
of getters optional, and use them directly as checkers.
- Don't unnecessarily read the whole line, but check
only the shebang first.
Previously, if pid == 0 and we're PID 1, get_process_ppid()
would set ret to getppid(), i.e. 0, which is inconsistent
when pid is explicitly set to 1. Ensure we always handle
such case by returning -EADDRNOTAVAIL.
[ 10.056930] (journald)[104]: Failed to fork off '(sd-mkuserns)': Invalid argument
[ 10.063727] systemd[1]: systemd-modules-load.service: About to execute: /usr/lib/systemd/systemd-modules-load
[ 10.071148] (journald)[104]: Failed to fork process (sd-mkuserns): Invalid argument
safe_fork_full() already logs at debug level, so the caller shouldn't.
As reported in https://github.com/systemd/systemd/issues/35400,
on riscv64, with Linux version 6.6.51-linux4microchip+fpga-2024.09, we get:
[ 10.063727] systemd[1]: systemd-modules-load.service: About to execute: /usr/lib/systemd/systemd-modules-load
[ 10.071148] (journald)[104]: Failed to fork process (sd-mkuserns): Invalid argument
Fixes https://github.com/systemd/systemd/issues/35400.
'r' is used to make the repeated checks shorter. Without that, the long variable
name is distracting.
This is a more comprehensive fix compared to #35273. Also adds a minimal
test only.
Based on Luca's #35273 but generalizes the code a bit.
In v258 we really should get rid of the old heuristics around userns and
cgroupns detection, but given we are late in the v257 cycle this keeps
them in.
The indoe number of root pid namespace is hardcoded in the kernel to
0xEFFFFFFC since 3.8, so check the inode number of our pid namespace
if all else fails. If it's not 0xEFFFFFFC then we are in a pid
namespace, hence a container environment.
Fixes https://github.com/systemd/systemd/issues/35249
[Reworked by Lennart, to make use of namespace_is_init()]
"nsenter -a" doesn't migrate the specified process into the target
cgroup (it really should). Thus the cgroup will remain in a cgroup
that is (due to cgroup ns) outside our visibility. The kernel will
report the cgroup path of such cgroups as starting with "/../". Detect
that and print a reasonably error message instead of trying to resolve
that.
The auditing subsystem is still not virtualized for containers, hence
the two values don't really make sense inside them, they will just leak
information from outside into the container. Hence don't make use of the
data if we detect we are run inside of a container.
This has visible effects: logind will no longer try to reuse the
auditing session ids as its own session ids when run inside a container.
While are at it, modernize the calls in more ways:
1. switch to pidref behaviour, all but one of our uses are using pidref
anyway already.
2. use read_virtual_file() + proc_mounted()
3. reasonably distinguish ENOENT errors when reading the process proc
files: distinguish the case where /proc is not mounted, from the case
where the process is already gone, from where auditing is not enabled in
the kernel build.
The auditing subsystem is still not virtualized for containers, hence the two
values don't really make sense inside them, they will just leak
information from outside into the container. Hence don't make use of the
data if we detect we are run inside of a container.
This has visible effects: logind will no longer try to reuse the
auditing session ids as its own session ids when run inside a container.
While are at it, modernize the calls in more ways:
1. switch to pidref behaviour, all but one of our uses are using pidref
anyway already.
2. use read_virtual_file() + proc_mounted()
3. reasonable distinguish ENOENT errors when reading the process proc
files: distinguish the case where /proc is not mounted, from the case
where the process is already gone, from where auditing is not enabled
in the kernel build.
A bit confusingly CONTAINER_UID_BASE_MAX is just the maximum *base* UID
for a container. Thus, with the usual 64K UID assignments, the last
actual container UID is CONTAINER_UID_BASE_MAX+0xFFFF.
To make this less confusing define CONTAINER_UID_MIN/MAX that add the
missing extra space.
Also adjust two uses where this was mishandled so far, due to this
confusion.
With this change the UID ranges we default to should properly match what
is documented on https://systemd.io/UIDS-GIDS/.