This is a more comprehensive fix compared to #35273. Also adds a minimal
test only.
Based on Luca's #35273 but generalizes the code a bit.
In v258 we really should get rid of the old heuristics around userns and
cgroupns detection, but given we are late in the v257 cycle this keeps
them in.
The indoe number of root pid namespace is hardcoded in the kernel to
0xEFFFFFFC since 3.8, so check the inode number of our pid namespace
if all else fails. If it's not 0xEFFFFFFC then we are in a pid
namespace, hence a container environment.
Fixes https://github.com/systemd/systemd/issues/35249
[Reworked by Lennart, to make use of namespace_is_init()]
"nsenter -a" doesn't migrate the specified process into the target
cgroup (it really should). Thus the cgroup will remain in a cgroup
that is (due to cgroup ns) outside our visibility. The kernel will
report the cgroup path of such cgroups as starting with "/../". Detect
that and print a reasonably error message instead of trying to resolve
that.
The auditing subsystem is still not virtualized for containers, hence
the two values don't really make sense inside them, they will just leak
information from outside into the container. Hence don't make use of the
data if we detect we are run inside of a container.
This has visible effects: logind will no longer try to reuse the
auditing session ids as its own session ids when run inside a container.
While are at it, modernize the calls in more ways:
1. switch to pidref behaviour, all but one of our uses are using pidref
anyway already.
2. use read_virtual_file() + proc_mounted()
3. reasonably distinguish ENOENT errors when reading the process proc
files: distinguish the case where /proc is not mounted, from the case
where the process is already gone, from where auditing is not enabled in
the kernel build.
The auditing subsystem is still not virtualized for containers, hence the two
values don't really make sense inside them, they will just leak
information from outside into the container. Hence don't make use of the
data if we detect we are run inside of a container.
This has visible effects: logind will no longer try to reuse the
auditing session ids as its own session ids when run inside a container.
While are at it, modernize the calls in more ways:
1. switch to pidref behaviour, all but one of our uses are using pidref
anyway already.
2. use read_virtual_file() + proc_mounted()
3. reasonable distinguish ENOENT errors when reading the process proc
files: distinguish the case where /proc is not mounted, from the case
where the process is already gone, from where auditing is not enabled
in the kernel build.
A bit confusingly CONTAINER_UID_BASE_MAX is just the maximum *base* UID
for a container. Thus, with the usual 64K UID assignments, the last
actual container UID is CONTAINER_UID_BASE_MAX+0xFFFF.
To make this less confusing define CONTAINER_UID_MIN/MAX that add the
missing extra space.
Also adjust two uses where this was mishandled so far, due to this
confusion.
With this change the UID ranges we default to should properly match what
is documented on https://systemd.io/UIDS-GIDS/.
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.
Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.
We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.
When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.
Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.
Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.
Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
The previous commit removed the UINT_MAX check for the fd array. Let's
now re-add one, but at a better place, and with a more useful limit. As
it turns out the kernel does not allow passing more than 253 fds at the
same time, hence use that as limit. And do so immediately before
calculating the control buffer size, so that we catch multiplication
overflows.
The names of these conflict with macros from efi.h that we'll move
to efi-fundamental.h in a later commit. Let's avoid the conflict by
getting rid of these helpers. Arguably this also improves readability
by clearly indicating we're passing arbitrary strings and not constants
to the macros when we invoke them.
This makes use of the new TIOCGPTPEER pty ioctl() for directly opening a
PTY peer, without going via path names. This is nice because it closes a
race around allocating and opening the peer. And also has the nice
benefit that if we acquired an fd originating from some other
namespace/container, we can directly derive the peer fd from it, without
having to reenter the namespace again.
instead of passing a boolean picking the destruction method just have
different functions. That's much nicer in context of _cleanup_, and how
we usually do things.
Setting this flag is a noop without a corresponding call to
posix_spawnattr_setsigdefault.
If we call posix_spawnattr_setsigdefault with a full signal set,
it causes glibc's posix_spawn implementation to call sigaction 63 times,
once for each signal. That seems wasteful.
This feature is really only useful for signals which have their
disposition set to SIG_IGN. Otherwise the dispostion gets set to
SIG_DFL automatically, either by clone(CLONE_CLEAR_SIGHAND) or the
subsequent execve.
As far as I can tell, systemd does not have any signals set to SIG_IGN
under normal operating conditions.
OSC sequences can be closed with one of three terminators:
1. ASCII code 7, aka BEL, aka ^G, aka \x07, aka \a
2. ASCII code 156, aka \x9c
2. Pair of ASCII code 27 followed by ASCII code 92, aka \x1b\x5c
Of these, in some corner case scenarios BEL makes problem (see #34604).
Hence switch away from that wherever we use it, and prefer the \x1b\x5c
instead. That's preferable over \x9c, since the latter is also a valid
UTF-8 codepoint. See discussion here for example:
https://gist.github.com/egmontkob/eb114294efbcd5adb1944c9f3cb5feda#the-escape-sequenceFixes: #34604
Various fixes:
1. Adds O_CLOEXEC for two socketpair()s where we forgot it.
2. Uses FORK_WAIT instead of manual wait_for_terminate_and_check()
invocations.
3. Prefix opaque NULL/0 arguments with comments what they are.
4. Add a banch of assert()s, and change flag validation in
open_terminal() to be assert() (since flags mistakes are programming
errors, not runtime errors).