Although this slightly more verbose it makes it much easier to reason
about. The code that produces the tests heavily benefits from this.
Test lists are also now sorted by test name.
When using --private-users, we have to create bind mount points as
the user that will become root in the user namespace, so let's take
that into account.
If we're in a user namespace but not unsharing the network namespace,
we won't be able to bind any privileged ports even with
CAP_NET_BIND_SERVICE, so let's drop it from the retained capabilities
so services can condition themselves on that.
Let's port one more over.
Note that this changes behaviour of file_in_same_dir() in some regards.
Specifically, a trailing slash of the input path will be treated
differently: previously we'd operate below that dir then, instead of the
parent. I think that makes little sense however, and I think the code
using this function doesn't expect that either.
Moroever, addresses some corner cases if the path is specified as "/" or
".", i.e. where e cannot extract a parent. These will now be treated as
error, which I think is much cleaner.
-1 was used everywhere, but -EBADF or -EBADFD started being used in various
places. Let's make things consistent in the new style.
Note that there are two candidates:
EBADF 9 Bad file descriptor
EBADFD 77 File descriptor in bad state
Since we're initializating the fd, we're just assigning a value that means
"no fd yet", so it's just a bad file descriptor, and the first errno fits
better. If instead we had a valid file descriptor that became invalid because
of some operation or state change, the other errno would fit better.
In some places, initialization is dropped if unnecessary.
This is an octal number. We used the 0 prefix in some places inconsistently.
The kernel always interprets in base-8, so this has no effect, but I think
it's nicer to use the 0 to remind the reader that this is not a decimal number.
Also stop stashing the kmsg fifo fd in the socket. Just retrieve it in
the parent and have the parent hold on to it.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Before we supported pivot_root() nspawn used to make the rootfs shared
before setting up the mount tunnel. So it was safe for it to just turn
it into a dependent mount during setup.
However, we cannot do this anymore because of the requirements
pivot_root() has. After the pivot_root() we will make the rootfs shared
recursively. If we turned the mount tunnel into dependent mount before
mount_switch_root() this will have the consequence that it becomes a
shared mount within the same peer group as the rootfs. So no mounts will
propagate into the container from the host anymore.
To fix this we split setting up the mount tunnel and making it active
into two steps. Setting up the mount tunnel is performed before
mount_switch_root() and activating it afterwards. Note that this works
because turning a shared mount into a shared mount is a nop. IOW, no new
peer group will be allocated.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
In order to mount procfs and sysfs in an unprivileged container the
kernel requires that a fully visible instance is already present in the
target mount namespace. Mount one here so the inner child can mount its
own instances. Later we umount the temporary instances created here
before we actually exec the payload. Since the rootfs is shared the
umount will propagate into the container. Note, the inner child wouldn't
be able to unmount the instances on its own since it doesn't own the
originating mount namespace. IOW, the outer child needs to do this.
So far nspawn didn't run into this issue because it used MS_MOVE which
meant that the shadow mount tree pinned a procfs and sysfs instance
which the kernel would find. The shadow mount tree is gone with proper
pivot_root() semantics.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
In order to support pivot_root() we need to move mount propagation
changes after the pivot_root(). While MS_MOVE requires the source mount
to not be a shared mount pivot_root() also requires the target mount to
not be a shared mount. This guarantees that pivot_root() doesn't leak
any mounts.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Curently, these two flags were implied by dissect_loop_device(), but
that's not right, because this means systemd-gpt-auto-generator will
dissect the root block device with these flags set and that's not
desirable: the generator should not cause the partition devices to be
created (we don't intend to use them right-away after all, but expect
udev to find/probe them first, and then mount them though .mount units).
And there's no point in opening the partition devices, since we do not
intend to mount them via fds either.
Hence, rework this: instead of implying the flags, specify them
explicitly.
While we are at it, let's also rename the flags to make them more
descriptive:
DISSECT_IMAGE_MANAGE_PARTITION_DEVICES becomes
DISSECT_IMAGE_ADD_PARTITION_DEVICES, since that's really all this does:
add the partition devices via BLKPG.
DISSECT_IMAGE_OPEN_PARTITION_DEVICES becomes
DISSECT_IMAGE_PIN_PARTITION_DEVICES, since we not only open the devices,
but keep the devices open continously (i.e. we "pin" them).
Also, drop the DISSECT_IMAGE_BLOCK_DEVICE combination flag, since it is
misleading, i.e. it suggests it was appropriate to specify on all
dissected blocking devices, but that's precisely not the case, see the
systemd-gpt-auto-generator case. My guess is that the confusion around
this was actually the cause for this bug we are addressing here.
Fixes: #25528
We only allow a selected subset of syscalls from nspawn containers
and don't list any time64 variants (needed for 32-bit arches when
built using TIME_BITS=64, which is relatively new).
We allow sched_rr_get_interval which cpython's test suite makes
use of, but we don't allow sched_rr_get_interval_time64.
The test failures when run in an arm32 nspawn container on an arm64 host
were as follows:
```
======================================================================
ERROR: test_sched_rr_get_interval (test.test_posix.PosixTester.test_sched_rr_get_interval)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/var/tmp/portage/dev-lang/python-3.11.0_p1/work/Python-3.11.0/Lib/test/test_posix.py", line 1180, in test_sched_rr_get_interval
interval = posix.sched_rr_get_interval(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 1] Operation not permitted
```
Then strace showed:
```
sched_rr_get_interval_time64(0, 0xffbbd4a0) = -1 EPERM (Operation not permitted)
```
This appears to be the only time64 syscall that isn't already included one of
the sets listed in nspawn-seccomp.c that has a non-time64 variant. Checked
over each of the time64 syscalls known to systemd and verified that none
of the others had a non-time64-variant whitelisted in nspawn other than
sched_rr_get_interval.
Bug: https://bugs.gentoo.org/880131
The name "def.h" originates from before the rule of "no needless abbreviations"
was established. Let's rename the file to clarify that it contains a collection
of various semi-related constants.
I wanted to move saved_arg[cv] to process-util.c+h, but this causes problems:
process-util.h includes format-util.h which includes net/if.h, which conflicts
with linux/if.h. So we can't include process-util.h in some files.
But process-util.c is very long anyway, so it seems nice to create a new file.
rename_process(), invoked_as(), invoked_by_systemd(), and argv_looks_like_help()
which lived in process-util.c refer to saved_argc and saved_argv, so it seems
reasonable to move them to the new file too.
util.c is now empty, so it is removed. util.h remains.