We need access to /dev/net/tun, hence make sure we can actually see
/dev/. Also make sure the module is properly loaded before we operate,
given that we run with limit caps. But then again give the CAP_NET_ADMIN
cap, since we need to configure the network tap/tun devices.
Follow-up for: 1365034727
This also makes initrd-cleanup.service explicitly start
initrd-switch-root.service with replace-irreversibly mode, to avoid
systemd-udevd.service being triggered by kernel events and the start
job of initrd-switch-root.service being cancelled.
Follow-ups for 676fb42aae.
Addresses https://github.com/systemd/systemd/pull/37374#issuecomment-2875990471.
The Synchronize() function is just too useful for clients, so that we
can make "systemd-run -v --user" actually useful. Hence let's make the
socket accessible without privs. Deny most method calls however, except
for the Synchronize() call.
You can configure the sysfail boot entry using the bootctl command:
$ bootctl set-sysfail sysfail.conf
The value will be stored in the `LoaderEntrySysFail` EFI variable.
The `LoaderEntrySysFail` EFI variable would be unset automatically
during next boot by `systemd-boot-clear-sysfail.service` if no
system failure occured, otherwise it would be kept as it is and a system
failure reason will be saved to `LoaderSysFailReason` EFI variable.
Signed-off-by: Igor Opaniuk <igor.opaniuk@foundries.io>
Otherwise, initrd-cleanup.service requests isolation thus the socket
is stopped before switching root, and several early events after
switching root may be lost.
We usually don't care, but here the existence of socket
is public API to a certain degree and signals availability
of the service (userdbd in particular, oomd is checked in
core-varlink.c). Hence let's be more careful and remove them
if stopped.
The current arrangement of service and socket units is
sort of all over the place. Let's clean it up a little,
roughly following the principles below:
- socket units have implicit ordering deps (not to be confused
with default ones which are subject to DefaultDependencies=)
before associated service, so drop any explicit After=
- If socket can be enabled, remember to link to it in service
via Also= and Sockets= (the latter replaces Wants=).
If the service Requires= socket however, Sockets= is omitted.
- If socket is statically enabled, no need for service
to pull it in - machined
Add two new socket units, one for each of systemd-resolved's varlink
servers:
systemd-resolved-varlink.socket
systemd-resolved-monitor.socket
Add logic to grab socket fds via sd_varlink_server_listen_name(), but
fallback to the existing sd_varlink_server_listen_address() calls if no
fds were given.
This will be used to make systemd-networkd-wait-online --dns more robust
against systemd-resolved restarts etc.
Delegation is enabled for udev so that it can mess around with the
cgroup hierarchy to avoid killing control processes when it calls
cg_kill in on_post() when it goes idle. We don't actually care about
any specific cgroup controllers in udev, so set Delegate= to enable
delegation without delegating any controllers
Follow up for https://github.com/systemd/systemd/pull/22752
This new tool looks for a three xattr on the root inode of a file system
that encode mount constraints of the file system. The tool is supposed
to be hooke into the mount logic and is supposed to protect against
misappropriating trusted file systems in unintended ways.
Consider the following scenario: we boot up on first boot and create a
tpm-locked pair of /var/ and /srv/ partitions via systemd-repart. An
attacker then offline modifies the partition table, exchanging the
metadata of the /var/ and /srv/ partition. So far we'd happily accept
that, honour the modified metadata and boot up. This could be used to
revert changes to /var/ or similar. And all that even though both
partitions are encrypted and locked to TPM!
With this new mechanism we can encode in the protected contents of the
file systems the ways it can be used: the partition type uuid, the
partition label and the intended mount point can be stored in xattrs,
and we can check them automatically on mount, and take action on
mismatch. (action would typically be immediate reboot).
PCR extensions are supposed to be useful for "destroying" the ability to
access TPM bound secrets. Hence, if for some reason we fail to extend a
PCR, it's safer to just reboot, instead of going on without the
extension, leaving secrets potentially accessible which should not be
accessible.
Note that the services exit gracefully if no TPM is found, hence this
should not be triggered on TPM-less systems. However, this enforces that
if there is a TPM that is accessible to Linux and that works properly,
the PCR measurement must complete too.
Inspired by this thread:
https://lists.freedesktop.org/archives/systemd-devel/2025-March/051244.html
Let's always prefer quotactl_fd() when it's available and use quotactl()
only as as a fallback on old kernels.
This way we can operate on the fds we typically already have open, or if
needed we can open a new one, and use for multiple fs operation.
In the long run we should really focus on operating exclusively by fd
instead of by path, by device nor or otherwise. This gets us a step
closer to that.
Let's allow providing extra userdb users and groups via credentials.
Similarly to systemd-udev-load-credentials.service, we ship
systemd-userdb-load-credentials.service which transform the JSON
user/group records provided via the corresponding credentials to static
userdb dropins in /run/userdb.
Let's allow providing extra userdb users and groups via credentials.
Similarly to systemd-udev-load-credentials.service, we ship
systemd-userdb-load-credentials.service which transform the JSON
user/group records provided via the corresponding credentials to static
userdb dropins in /etc/userdb.
Replaces #33811
This is useful when the previous invocation is unexpectedly killed.
Otherwise, if systemd-nspawn is killed forcibly, then unix-export
directory is not cleared and unmounted, and the subsequent invocation
will fail. E.g.
===
[ 18.895515] TEST-13-NSPAWN.sh[645]: + machinectl start long-running
[ 18.945703] systemd-nspawn[1387]: Mount point '/run/systemd/nspawn/unix-export/long-running' exists already, refusing.
[ 18.949236] systemd[1]: systemd-nspawn@long-running.service: Failed with result 'exit-code'.
[ 18.949743] systemd[1]: Failed to start systemd-nspawn@long-running.service.
===
oomd only works well if we have swap, hence we should not start it
before swaps are up, in particular as we will print an annoying message
otherwise.
Fixes: #36704
Now our baseline of meson is 0.62, hence install_symlink() can be used.
Note, install_symlink() implies install_emptydir() for specified
install_dir. Hence, this also drops several unnecessary
install_emptydir() calls.
Note, the function currently does not support 'relative' and 'force' flags,
so several 'ln -frsT' inline calls cannot be replaced.
It is strictly mandatory that this is done during initial
transaction, and not later when the system is already running.
Hence let's refuse manual start for all of the involved units.
Additionally, refuse manual stop for systemd-factory-reset-complete.service,
as it flags the factory reset completion through
/run/systemd/factory-reset-complete, which never gets removed
for the whole boot.
While we typically just use /dev/tpmrm0 for accessing the TPM chip (i.e
via the kernel's own resource manager), some sysfs properties that
matter are on /dev/tpm0 only (i.e. the version without the kernel TPM
resource manager). Hence, wait for both to show up in tpm2.target, so
that we can be sure the full API is available.
This matters because we want to access /sys/class/tpm/tpm0/ppi/request
in the next commit.
This introduces a bunch of facilities:
1. The factory-reset.target unit that requests a factory reset is now
complemented by factory-reset-now.target that executes it at next
boot.
2. This latter is added to the initial transaction via the new trivial
systemd-factory-reset-generator.
3. A tool systemd-factory-reset has been added to query, request,
cancel, complete factory reset operations (via EFI variables). Two of
these are wrapped into units that are plugged into
factory-reset.target and factory-reset-now.target respectively. The
tool also provides a simple Varlink API.
This should make things a lot cleaner, and both be useful as explicit
implementation on UEFI, and as template + hookpoints for alternative
implementations on non-UEFI.
Terminating the plymouth/console agents when the wall agent takes over
can happen asynchronously, after all the pw queries are async anyway and
hence can be seen by both the plymouth/console agents and the wall
agent.
By stopping the two agents with "--no-block" we add a bit of robustness,
since trouble of them exiting won't block the wall agent to start.
This addresses the issue the previous commit fixes in a different way.
Let's make sure that the moment where factory reset is requested is
visible in the TPM PCR state, so that access to secrets is terminated.
This is particulary interesting when the system is booted with
systemd.unit=factory-reset.target on the kernel command line, requesting
a factory reset on the following boot. The preparations done in
userspace should already lose access to the TPM in that case.
storagetm mode means we we are network accessible. let's lock down
access to TPM secrets in this case: let's measure a pcr "phase" string
into PCR 11.
This is good as it means that if we are exploited in this state FDE
secrets protected by TPM are likely to remain protected, since the PCR
values wouldn't allow access.
Let's make the three subsystems more alike, and add remote-*setup.traget
for all three, enable them all three in the presets, and make them
behave in a similar fashion.
This is useful for booting from a freshly downloaded disk image: just
specify
rd.systemd.pull=verify=no,machine,blockdev,raw:image:https://192.168.100.1:8081/image.raw
root=/dev/disk/by-loop-ref/image.raw-part2
on the kernel command line, and we'll download that in the initrd and boot from it.
(note the above disables download-time verification, putting trust in
verity and image policy that this won#t do harm)
Here's a more complete example. From a git checkout do:
ninja -C build && mkosi -f -T serve
and then from another terminal do within the same checkout:
./build/systemd-vmspawn \
--ram=16G \
--register=no \
-n \
-i ./build/mkosi.output/image.raw \
rd.systemd.pull=verify=no,machine,blockdev,raw:image:http://192.168.100.1:8081/image.raw \
root=/dev/disk/by-loop-ref/image.raw-part2 \
rootflags=x-systemd.device-timeout=infinity \
ip=any
This will then boot via the ESP of the specified image, then download
the image via HTTP from the mkosi instance running in the first
terminal, attach it to a loopback block device, and then use its second
partition as root fs, and boot into it.
(this assumes your host is 192.168.100.1, of course)
Note that downloading the full image takes a bit of time (this downloads
it uncompressed after all), hence we turn off the timeout to wait for
the device.
This also introduces a new "imports.target" unit (and associated
"imports-pre.target") between imports are grouped, and which ensure the
imports actually are ordered correctly both on the host and in the
initrd.
This is mostly just a friendly unit wrapper around "systemd-dissect
--attach".
This is useful so that we can automatically attach disk images as
block device at boot.
09fbff57fc introduced new knob
for such functionality. However, that seems unnecessary.
The mount option string is ubiquitous in that all of fstab,
kernel cmdline, credentials, systemd-mount, ... speak it.
And we already have x-systemd.device-bound= that's parsed
by pid1 instead of fstab-generator. It feels hence more natural
for graceful options to be an extension of that, rather than
its own property.
There's also one nice side effect that the setting itself
is now more graceful for systemd versions not supporting
such feature.
systemd-mountfsd so far provided a MountImage() API call for mounting a
disk image and returning a set of mount fds. This complements the API
with a new MountDirectory() API call, that operates on a directory
instead of an image file. Now, what makes this interesting is that it
applies an idmapping from the foreign UID range to the provided target
userns – and in which case unpriveleged operation is allowed (well,
under some conditions: in particular the client must own a parent dir of
the provided path).
This allows container managers to run fully unprivileged from
directories – as long as those directories are owned by the foreign UID
range. Basic operation is like this:
1. acquire a transient userns from systemd-nsresourced with 64K users
2. ask systemd-mountfsd for an idmapped mount of the container dir
matching that userns
3. join the userns and bind the mount fd as root.
Note that we have to drop various sandboxing knobs from the mountfsd
service file for this to work, since the kernel's security checks that
try to ensure than an obstructed /proc/ cannot be circumvented via
mounting a new procfs will otherwise prohibit mountfsd to duplicate the
mounts properly.
File system modules should be something the kernel can autoload
automatically, and according to my testing that works fine, hence let's
drop the explicit deps, in particular as systems usually stick to one fs
for these things, not both.
I inquired bluca about the reason to add it, and didn't remember
anymore, and was fine with me removing this. So let's remove this for
now, should issues arise we can revert this.
mountfsd is supposed to be available during early boot aleady, before
systemd-tmpfiles-setup-dev-early.service completes, hence make sure
loopback devices and DM already work before that.
As suggested by yuwata here:
https://github.com/systemd/systemd/pull/35685#issuecomment-2608157569