This extends the functionality introduced in #21648 to allow using
addresses chosen from a delegated prefix as well as the existing
SLAAC/LL/DHCP functionality.
It turns out checking sysfs is not 100% reliable to figure out whether
the firmware had TPM2 support enabled or not. For example with EDK2 arm64, the
default upstream build config bundles TPM2 support with SecureBoot support,
so if the latter is disabled, TPM2 is also unavailable. But still, the ACPI
TPM2 table is created just as if it was enabled. So /sys/firmware/acpi/tables/TPM2
exists and looks correct, but there are no measurements, neither the firmware
nor the loader/stub can do them, and /sys/kernel/security/tpm0/binary_bios_measurements
does not exist.
The loader can use the apposite UEFI protocol to check, which is a more
definitive answer. Given userspace can also make use of this information, export
the bitmask with the list of active banks as-is. If it's not 0, then we can be
sure a working TPM2 was available in EFI mode.
Partially fixes https://github.com/systemd/systemd/issues/38071
Aside from the usual boilerplate of moving the shared logic to shared/,
we also rework the implementation of --bind-user= to be similar to what
we'll do in systemd-vmspawn. Instead of messing with the nspawn container
user namespace, we use idmapped mounts to map the user's home directory on
the host to the mapped uid in the container.
Ideally we'd also use the "userdb.transient" credentials to provision the
user records, but this would only work for booted containers, whereas the
current logic works for non-booted containers as well.
Aside from being similar to how we'll implement --bind-user= in vmspawn,
using idmapped mounts also allows supporting --bind-user= without having to
use --private-users=.
Systemd-stub supports loading addons, credentials, system and
configuration
extensions from ESP and while addons and credentials can be both global
and
per-UKI, sysext/confext are only per-UKI.
Add support for global sysext/confext to systemd-stub/systemd-sysext.
Fixes#37993
This adds missing glue to reasonably allow unpriv users VMs/containers
to register with the system machined.
This primarily adds two things:
1. machined can now properly track VMs/containers residing in subcgroups
of units, because that's effectively what happens for per-user
VMs/containers: they are placed below the system unit `user@….service`
in some user unit.
2. machines registered with machined now have an owning UID: users can
operate on their own machines withour re-authentication, but not on
others.
Note that this is only a first step regarding machined's hookup of
nspawn/vmspawn in the long run for unpriv operation.
I think eventually we should make it so that there's both a per-user and
a per-system machined instance (so far, and even with this PR there's
still one per-system instance), and per-user containers/VMs would
registering with *both*. Having two instances makes sense I think,
because it would mean we can make machined reasonably manage the
per-user image discovery, and also do the per-system network/hostname
handling.
This mimics the switch of the same name from nspawn: it controls whether
we expect a READY=1 message from the payload or not. Previously we'd
always expect that. This makes it configurable, just like it is in
nspawn.
There's one fundamental difference in behaviour though: in nspawn it
defaults to off, in vmspawn it defaults to on. (for historical reasons,
ideally we'd default to on in both cases, but changing is quite a compat
break both directly and indirectly: since timeouts might get triggered).
This beefs up the cgroup logic, adding --slice=, --property= to vmspawn
the same way it already exists in nspawn.
There are a bunch of differences though: we don't delegate the cgroup
access in the allocated unit (since qemu wouldn't need that), and we do
registration via varlink not dbus. Hence, while this follows a similar
logic now, it differs in a lot of details.
This makes in particular one change: when invoked on the command line
we'll only add the qemu instance to the allocated scope, not the vmspawn
process itself (this follows more closely how nspawn does this where
only the container payload has its scope, not nspawn itself). This is
quite tricky to implement: unlike in nspawn we have auxiliary services
to start, with depencies to the scope. This means we need to start the
scope early, so that we know the scope's name. But the command line to
invoke is only assembled from the data we learn about the auxiliary
services, hence much later. To addres we'll now fork off the child that
eventually will become early, then move it to a scope, prepare the
cmdline and then very late send the cmdline (and the fds we want to
pass) to the prepared child, which then execs it.
So far, machined strictly tracked the "leader" process of a machine,
i.e. the topmost process that is actually the payload of the machine.
Its runtime also defines the runtime of the machine, and we can directly
interact with it if we need to, for example for containers to join the
namespaces, or kill it.
Let's optionally also track the "supervisor" process of a machine, i.e.
the host process that manages the payload if there is one. This is
generally useful info, but in particular is useful because we might need
to communicate with it to shutdown a machine without cooperation of the
payload. Traditionally we did this by simply stopping the unit of the
machine, but this is not doable now that the host machined can be used
to track per-user machines.
In the long run we probably want a more bespoke protocol between
machined and supervisors (so that we can execute other commands too,
such as request cooperative reboots/shutdowns), but that's for later.
Some environments call the concept "monitor" rather than "supervisor" or
use some other term. I stuck to "supervisor" because nspawn uses this,
and ultimately one name is as good as another.
And of course, in other implementations of VM managers of containers
there might not be a single process tracking each VM/container. Because
of this, the concept of a supervisor is optional.
Now that unpriv clients can register machines, let's register their UID
too. This allows us to do two things:
1. make sure the scope delegation is assigned to the right UID (so that
the unpriv user can actually create cgroups below the delegated
scope)
2. permit certain types of access (i.e. killing, or pty access) to the
client without auth if it owns the machine.
Systemd-stub support loading addons, credentials, system and configuration
extensions from ESP and while addons and credentials can be both global and
per-UKI, sysext/confext are only per-UKI.
Add support for loading ESP/loader/credentials/*.{sysext,confext}.raw to
systemd-stub.
Note: for backwards compatibility reasons, per-UKI sysexts can also be
*.raw (not only *.sysext.raw) but as global extensions are new, there's
no need to bring this legacy there.
Continuation of #37960.
The same concern as expalined in #37960 exists also in
missing_syscall.h. If we use enough new glibc, a function we want to use
may be already provided by glibc, but our baseline glibc may not. And it
is hard to detect in our daily development.
This moves all prototypes of syscalls to relevant headers, and missing
syscall functions are defined in relevant .c files of libc wrapper. This
way, we can use usual header as is, e.g. when we want to write code with
`move_mount()`, we can simply use sys/mount.h without checking if it is
supported by our baseline glibc.
- pass our system include directories to make generators use our libc
wrappers and latest kernel headers,
- include relevant headers in generated gperf file,
- use files() rather than find_program(), as the result of
find_program() cannot be passed to 'input' of custom_target(),
- move generate-bpf-delegate-configs.py to src/core/, as it is only used
by libcore.
Add a small paragraph explaining how BPF token works, how it's being
created and its relationship between the BPF filesystem.
Move all the relevant documentation in the PrivateBPF= section and let
point all the BPFDelegate* options to that one.
To implement --bind-user in systemd-vmspawn, we need a transient
version of these credentials. These are useful when the home directory
of the user is mounted into the container/vm and every trace of the user
will be (mostly) gone again when the container/vm is shut down.
closes#37602, see there for extra motivation and considered
alternatives.
On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.
## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
- VM system boots successfully
- defaults apply (both `yes`, `no`, and undefined)
- systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
Based on https://github.com/systemd/systemd/issues/7820, this adds support for
quota enforcement to State, Cache, and Log exec directories.
* Add new directives, StateDirectoryQuota=, CacheDirectoryQuota=, and
LogDirectoryQuota=, to define quotas as percentages (hard limits for
blocks and inodes) or absolute values (hard limits for blocks only).
* Add new directives, StateDirectoryQuotaAccounting=,
CacheDirectoryQuotaAccounting= and LogDirectoryQuotaAccounting= to keep
track of storage quotas but not enforce them (effectively just assigning
a project ID to defined exec directories).
Example:
```
StateDirectory=quotadir
StateDirectoryQuota=1%
Jan 06 22:55:46 abeltran: Storage quotas set for /var/lib/private/quotadir. Block limit = 2639404, inode limit = 671088
root@abeltran:/var/lib/private# lsattr -pR
3153000189 --------------e----P-- ./quotadir
root@abeltran:/var/lib/private# repquota -P /datadrive
*** Report for project quotas on device /dev/sdc1
Block grace time: 7days; Inode grace time: 7days
Block limits File limits
Project used soft hard grace used soft hard grace
----------------------------------------------------------------------
#0 -- 213200 0 0 4086 0 0
#3153000189 -- 2639404 0 2639404 2 0 671088
```