In Arch Linux we currently have a half-baked apparmor support,
in particular we cannot link systemd to libapparmor for service
context integration, since that will pull apparmor into base system.
Hence, let's turn this into a dlopen dep.
Ref: https://gitlab.archlinux.org/archlinux/packaging/packages/systemd/-/issues/22
With this we can now do:
systemd-vmspawn -n -i foobar.raw -s io.systemd.boot.entries-extra:particleos-current.conf=$'title ParticleOS Current\nuki-url http://example.com/somedir/uki.efi'
Assuming sd-boot is available inside the ESP of foobar.raw a new item
will show up in the boot menu that allows booting directly into the
specified UKI.
This one is between "efi" and "linux": we'll recognize such entries as
linux, but we'll just invoke them as EFI binaries.
This creates a high-level concept for invoking UKIs via indirection of a
bls type #1 entry, for example to permit invocation from a non-standard
path or for giving entries a different name.
Companion BLS spec PR:
https://github.com/uapi-group/specifications/pull/135
(Let's rename LOADER_UNIFIED_LINUX to LOADER_TYPE2_UKI at the same time
to reduce confusion what is what)
So far our login in gpt-auto-generator when run in the initrd has been
to generate the units that wait for /dev/gpt-auto-root to show up and
mount them only if we have the loader partition EFI variables set. This
is of course not the case for network boots with a UKI kernel, which
means auto-gpt would not work for mounting the rootfs.
What's nasty is that we don't know for sure whether the "rootdisk"
loopback device will shown up eventually, as it needs explicit
configuration by the user via the kernel cmdline, or could be configured
entirely indepdenently. Hence, let's tweak the logic when we wat for
/dev/gpt-auto-root as device to mount: make the gpt auto root logic a
tristate: if root=gpt-auto is specified on the cmdline *definitely*
enable the logic. If root= is specified and set to anyting else,
*definitely* disable the logic. And if root= is not specified check for
the EFI partition vars – as before – to conditionalized things.
Or in other words, you can now boot the same image either via ESP/local
boot or via netboot with a kernelcmdline image like this:
rd.systemd.pull=verify=no,machine,blockdev,bootorigin,raw:rootdisk:image.raw root=gpt-auto rootflags=x-systemd.device-timeout=infinity ip=any
So far the gpt-auto logic only looked for the partition table of devices
that the ESP/XBOOTLDR partition used to boot was on. This works great
for local boots, but is more problematic if we boot a UKI via UEFI HTTP
boot, because there is no ESP in play in that case.
Let's introduce an alternative to communicate the intended default root
disk to cover for this situation: any loopback block device whose
backing file field (i.e. the userspace controlled freeform field we use
for /dev/disk/by-loop-ref/ naming) is set to "rootdisk" will be consider
for gpt-auto will be consider for gpt-auto.
With this in place we should have nice automatic behaviour:
1. If we are booted locally we'll get the ESP/XBOOTLDR data, and derive
the root disk from that.
2. If we are booted via UEFI HTTP boot we expect that the caller makes
the loopback device appear with the right loop-ref identifier, and
then will use that.
Previously, we'd name the import services numerically. Let's instead use
the local target file name, i.e. the object we are creating with these
services locally. That's useful so that we can robustely order against
these service instances, should we need to one day.
This is useful for bind mounting a freshly downloaded and unpacked tar
disk images to /sysroot to mount into.
Specifically, with a kernel command line like this one:
rd.systemd.pull=verify=no,machine,tar:root:http://_gateway:8081/image.tar root=bind:/run/machines/root ip=any
The first parameter downloads the root image, the second one then binds
it to /sysroot so that we can boot into it.
This is useful in particular in the initrd, as this ensures any
downloaded images are not deleted during the initrd→host transition
(where /var/ does not survive, but /run/ does). Might be useful in other
cases too, for example for transiently deployed confexts and such.
This is useful for booting from a freshly downloaded disk image: just
specify
rd.systemd.pull=verify=no,machine,blockdev,raw:image:https://192.168.100.1:8081/image.raw
root=/dev/disk/by-loop-ref/image.raw-part2
on the kernel command line, and we'll download that in the initrd and boot from it.
(note the above disables download-time verification, putting trust in
verity and image policy that this won#t do harm)
Here's a more complete example. From a git checkout do:
ninja -C build && mkosi -f -T serve
and then from another terminal do within the same checkout:
./build/systemd-vmspawn \
--ram=16G \
--register=no \
-n \
-i ./build/mkosi.output/image.raw \
rd.systemd.pull=verify=no,machine,blockdev,raw:image:http://192.168.100.1:8081/image.raw \
root=/dev/disk/by-loop-ref/image.raw-part2 \
rootflags=x-systemd.device-timeout=infinity \
ip=any
This will then boot via the ESP of the specified image, then download
the image via HTTP from the mkosi instance running in the first
terminal, attach it to a loopback block device, and then use its second
partition as root fs, and boot into it.
(this assumes your host is 192.168.100.1, of course)
Note that downloading the full image takes a bit of time (this downloads
it uncompressed after all), hence we turn off the timeout to wait for
the device.
This also introduces a new "imports.target" unit (and associated
"imports-pre.target") between imports are grouped, and which ensure the
imports actually are ordered correctly both on the host and in the
initrd.
This ensures the child process is immediately re-parented to the
manager process which avoids a "Supervising process xxx which is
not our child. We'll most likely not notice when it exits." warning
which can currently happen if the parent systemd-executor parent
process sends the pid namespace child process pidref to the manager
process and the manager process dispatches the child process pidref
before the systemd-executor parent process exits, since at that point
the pid namespace child process's parent will still be the
systemd-executor parent process and not the manager process.
Let's switch things around, and move the internals of safe_fork_full() into
pidref_safe_fork_full() and make safe_fork_full() a trivial wrapper on top
of pidref_safe_fork_full().
This started out as a simple attempt to fix a race in an existing test
case for io.systemd.Machine.Open(). To address it nicely I added some
machinery to varlinkctl and systemd-notify though, and because of that I
refacting reception of sd_notify() messages in various places of our
codebase. So it became much much bigger.
This ports all receivers of sd_notify() messages over to a new common
implementation, except for one: the one in PID1. It's more powerful than
the others, since it accepts fds too. I think we should generalize
notify_recv() to cover that too, it's not that much more work, but for
now I didn't want to add even more refactorings on top, and this can
easily happen later separately, hence I left it out for now.
Sometimes it's interesting to condition units not just on the
installation but on the physical device. Let's make ConditionHost=
useful for that kind of checks, and while we are at it, also allow it to
be used for condition checks on the boot id.
Overloading like this is safe, since UUIDs are globally unique after
all, and hence there should be no conflicts between the namespace of
boot ids, machine ids and product ids.
Finally, relax rules on uuid checking: if the specified string parses
as uuid or id, also check it against the hostname, for setups where
people name hosts after uuids. I wouldn't know why anyone would do that,
but also, why not? shouldn'rt hurt allowing them and should not create
ambiguity conflicts.
This new switch makes it possible to process fds attached to a varlink
reply: if specified it will execute a command line specified by the
caller, will pass the response json as stdin, and any fds acquired as
fd3, fd4, …
So far we relied that the temporary file logic would create the key
files with 0600 mode, but let's set the access mode explicitly:
1. Tighten private key file access from 0600 to 0400, after all we never
want to write it again, it's not a mutable file.
2. Relaxed public key file access mode from 0600 to 0444, after all it's
a public key file, and people should be able to see it if they want
This is useful for propagating the key onto other systems if needed.