mirror of
https://github.com/morgan9e/systemd
synced 2026-04-14 00:14:32 +09:00
doc: add document explaining the 3 key components of the boot and how we find the rootfs
After the network boot PR got merged we can use some more high-level docs I guess.
This commit is contained in:
407
docs/ROOTFS_DISCOVERY.md
Normal file
407
docs/ROOTFS_DISCOVERY.md
Normal file
@@ -0,0 +1,407 @@
|
||||
---
|
||||
title: Boot Components & Root File System Discovery
|
||||
category: Booting
|
||||
layout: default
|
||||
SPDX-License-Identifier: LGPL-2.1-or-later
|
||||
---
|
||||
|
||||
# Boot Components & Root File System Discovery
|
||||
|
||||
The recommended way to boot a [`systemd`](https://systemd.io/) based
|
||||
[UEFI](https://en.wikipedia.org/wiki/UEFI) system consists primarily of three
|
||||
components:
|
||||
|
||||
1. A boot loader,
|
||||
i.e. [`systemd-boot`](https://www.freedesktop.org/software/systemd/man/latest/systemd-boot.html)
|
||||
that provides interactive and programmatic control of what precisely to
|
||||
boot. It takes care of enumerating all possible boot targets (implementing
|
||||
the [Boot Loader
|
||||
Specification](https://uapi-group.org/specifications/specs/boot_loader_specification/),
|
||||
potentially presenting it to the user in a menu, but otherwise picking an
|
||||
item automatically, implementing boot counting and automatic rollback if
|
||||
desired.
|
||||
|
||||
2. A [unified kernel image
|
||||
("UKI")](https://uapi-group.org/specifications/specs/unified_kernel_image/),
|
||||
i.e. an UEFI PE executable that combines
|
||||
[`systemd-stub`](https://www.freedesktop.org/software/systemd/man/latest/systemd-stub.html),
|
||||
a Linux kernel, and an initial RAM disk ("`initrd`") into one. UKIs are
|
||||
self-descriptive: the aforementioned boot loader enumerates these UKIs and
|
||||
automatically extracts all information necessary to determine which menu
|
||||
entries to generate for them. Within the UKI runtime (very early during
|
||||
kernel initialization) the transition from the UEFI firmware world into the
|
||||
Linux world takes place, i.e. it encapsulates the fundamental
|
||||
`ExitBootServices()` UEFI call, that ends PC firmware control, and lets the
|
||||
Linux kernel take over.
|
||||
|
||||
3. A root file system ("`rootfs`"): this is where the regular OS is
|
||||
located. The primary job of the early userspace that is contained in the
|
||||
`initrd` that itself is part of the UKI, is to find, set up and pivot into
|
||||
the `rootfs`.
|
||||
|
||||
> [!NOTE]
|
||||
> The above is how `systemd` upstream recommends a system is put together. However,
|
||||
> distributions differ from this, sometimes massively – in particular when
|
||||
> their focus is on supporting legacy (i.e. non-UEFI) hardware platforms. We
|
||||
> believe the above three components are all that's really necessary for a
|
||||
> robust, simple and comprehensive system, but downstreams might see things
|
||||
> differently. The above however is supposed to be a guideline for distribution
|
||||
> developers.
|
||||
|
||||
## Execution Environments
|
||||
|
||||
Note that these three components are executed within very distinct execution
|
||||
environments, with very different APIs, drivers and file system access:
|
||||
|
||||
| Component | Environment |
|
||||
|-----------|-------------|
|
||||
| Boot Loader: `systemd-boot` | UEFI APIs, simple VFAT file system access |
|
||||
| UKI initially: `systemd-stub` | UEFI APIs, simple VFAT file system access |
|
||||
| UKI finally: `initrd` | Linux APIs, complex storage and file systems |
|
||||
| Root File System: `rootfs` | Linux APIs, full OS functionality |
|
||||
|
||||
## Structure & Auxiliary Resources
|
||||
|
||||
Each of the three components is primarily encapsulated in a single object each:
|
||||
|
||||
1. The boot loader is primarily a single PE UEFI binary, called either
|
||||
`systemd-boot.efi` or `BOOTX64.EFI` depending on context (the latter name
|
||||
contains an architecture identifier, and is different for non-x86-64
|
||||
architectures).
|
||||
|
||||
2. The UKI is primarily a single PE UEFI binary (i.e. a `.efi` file).
|
||||
|
||||
3. The `rootfs` is typically a Linux file system, on a GPT partition table
|
||||
disk. Typically, the `rootfs` is placed within some form of container that
|
||||
ensures security of the file system, i.e. authenticity, confidentiality
|
||||
and integrity via `dm-verity`, `dm-crypt`, `dm-integrity` or a combination
|
||||
thereof.
|
||||
|
||||
While these three objects are generally enough to boot an OS successfully, in
|
||||
many cases some parameterization and modularization of the boot is necessary,
|
||||
hence each of these components is often combined with certain optional,
|
||||
auxiliary resources:
|
||||
|
||||
1. The boot loader can read a configuration file
|
||||
[`loader.conf`](https://www.freedesktop.org/software/systemd/man/latest/loader.conf.html),
|
||||
find additional drivers, or key material for SecureBoot enrollment in the
|
||||
same file system it itself is placed in.
|
||||
|
||||
2. `systemd-stub` can find additional parameters (["system
|
||||
credentials"](https://systemd.io/CREDENTIALS)), configuration
|
||||
(["`confext`"](https://www.freedesktop.org/software/systemd/man/latest/systemd-confext.html)),
|
||||
drivers/firmware ("`sysext`") and other resources ("EFI Addons") placed next
|
||||
to the location it itself is placed in. We typically call these companion
|
||||
resources "sidecars".
|
||||
|
||||
4. The `rootfs` often is a combination of one file system for `/usr/`
|
||||
("`usrfs`") and one for the actually root `/`, and possibly further,
|
||||
auxiliary file systems, for example for `/home/` or `/srv/`.
|
||||
|
||||
> [!NOTE]
|
||||
> Depending on the execution environment the first component (the boot loader)
|
||||
> might be dispensable. Specifically, on disk images intended solely for use in
|
||||
> VMs it might be make sense to tell the firmware to directly boot into a UKI,
|
||||
> as the UKI selection functionality is somewhat redundant there: the VMM's
|
||||
> image selection functionality might supersede it.
|
||||
|
||||
> [!NOTE]
|
||||
> Depending on the execution environment the last component (the `rootfs`)
|
||||
> might also be dispensable. Specifically, for simpler fixed-purpose, stateless
|
||||
> applications it might be sufficient to run everything needed directly from
|
||||
> the `initrd` file system embedded in the UKI, and never transition out of
|
||||
> this. In this case, conceptually the `initrd` is only an `initrd` from kernel
|
||||
> PoV, but is already the `rootfs` from a userspace PoV. We usually call these
|
||||
> types of setups Unified System Image ("USI"), as opposed to UKI.
|
||||
|
||||

|
||||
|
||||
## Automatic Discovery on Disks
|
||||
|
||||
In the most common case all three components and their sidecars are placed on
|
||||
the same disk. Specifically:
|
||||
|
||||
1. The boot loader is placed in the "EFI System Partition" (ESP), typically at
|
||||
the paths `/EFI/BOOT/BOOTX64.EFI` (this is a generic entrypoint binary that
|
||||
the firmware executes when you just point it to a disk to boot without any
|
||||
further details) and `/EFI/systemd/systemd-bootx64.efi` (this is a more
|
||||
specific entrypoint that can be registered persistently in the firmware, to
|
||||
give it an explicit starting point). The ESP is a concept defined by the
|
||||
UEFI specification and is what the firmware initially looks for and
|
||||
mounts. Since VFAT is the only relevant file system type UEFI firmwares have
|
||||
to support the ESP is generally a VFAT file system. The aforementioned
|
||||
auxiliary, optional resources the boot loader may consume are placed in the
|
||||
ESP as well, in particular below the `/loader/` subdirectory.
|
||||
|
||||
2. The UKIs may either be placed in the ESP (below the `/EFI/Linux/`
|
||||
subdirectory), or in the [Extended Boot Loader
|
||||
Partition](https://uapi-group.org/specifications/specs/boot_loader_specification/#the-partitions)
|
||||
("XBOOTLDR"), which can be placed on the same disk as the ESP and is also
|
||||
VFAT. XBOOTLDR is an optional concept and it's only *raison d'être* is that
|
||||
ESPs sometimes are sized too small by vendors, and do not have enough space
|
||||
for multiple UKIs. XBOOTLDR hence serves as a conceptual extension of the
|
||||
size constrained ESP. Sidecars for the UKIs are typically placed in a
|
||||
directory next to the UKI they are for, whose name however is suffixed by
|
||||
`.d/`, i.e. a UKI `foo.efi` has its sidecars in `foo.efi.d/`.
|
||||
|
||||
3. The `rootfs` is placed on the same disk as the ESP/XBOOTLDR, in a partition
|
||||
marked with a special GPT partition type. Various other well-known types of
|
||||
partitions can be placed next to the `rootfs` and are automatically
|
||||
discovered and mounted, see the [Discoverable Partitions
|
||||
Specification](https://uapi-group.org/specifications/specs/discoverable_partitions_specification/)
|
||||
for details.
|
||||
|
||||
In this common case, discovery of all three components and their sidecars is
|
||||
fully automatic. Each component derives automatically where to find its
|
||||
auxiliary resources as well as the next step to transition to, entirely based
|
||||
on the place itself is running from. There's a full chain of automatic
|
||||
discovery in place:
|
||||
|
||||
1. The firmware picks the disk to boot from (possibly by interactive choice of
|
||||
the user), accesses the ESP on it, and invokes the boot loader from it.
|
||||
|
||||
2. The boot loader then looks for UKIs, both on the ESP it was invoked from,
|
||||
and in the XBOOTLDR partition next to it on the same disk.
|
||||
|
||||
3. The UKI's initrd then looks for the `rootfs`, on the same disk the
|
||||
UKI was invoked from, i.e. it looks for a partition marked as root next to
|
||||
the ESP/XBOOTLDR partition. (This information is passed from UKI to
|
||||
userspace via the
|
||||
[`LoaderDevicePartUUID`](https://systemd.io/BOOT_LOADER_INTERFACE) EFI
|
||||
variable.)
|
||||
|
||||
In more complex setups it is possible to specify in more detail where to find
|
||||
each of these resources:
|
||||
|
||||
1. Firmware typically provides a basic boot menu which may be used to choose
|
||||
between various releavant boot loaders/entrypoints on multiple disks. This
|
||||
is sometimes configurable from the firmware setup tool, as well as from
|
||||
userspace via tools such as
|
||||
[`bootctl`](https://www.freedesktop.org/software/systemd/man/latest/bootctl.html),
|
||||
`efibootmgr` or `kernel-bootcfg`.
|
||||
|
||||
2. The `systemd-boot` boot loader may be configured via [`Boot Loader
|
||||
Specification Type #1`](https://uapi-group.org/specifications/specs/boot_loader_specification/)
|
||||
entries to acquire UKIs or similar from other locations.
|
||||
|
||||
3. The `initrd` part of the UKI understands the `root=` (and `mount.usr=`)
|
||||
kernel command line switches to look for the `rootfs`/`usrfs` at a
|
||||
particular place.
|
||||
|
||||
While it is recommended to keep all three components closely together it is
|
||||
possible via these mechanisms to place all three at completely disparate
|
||||
locations, too.
|
||||
|
||||
## Network Boot
|
||||
|
||||
In many cases it is essential to boot an OS from the network instead of a
|
||||
local disk. This can happen at each of these three components:
|
||||
|
||||
1. Many UEFI firmwares support HTTP(S) network boot (usually requires enabling
|
||||
in firmware setup). If this is available it permits downloading a disk image
|
||||
from an HTTP server (the URL can either be configured in the firmware setup,
|
||||
or be acquired inside a DHCP lease). The disk image is then set up as a RAM
|
||||
disk, and then processed much like a regular disk: an ESP is searched for
|
||||
and the `/EFI/BOOT/BOOTX64.EFI` entrypoint binary is invoked.
|
||||
|
||||
2. UKIs can be placed on the same downloaded disk image, within the ESP. If
|
||||
multiple different UKIs shall be made accessible from the same boot menu
|
||||
this would potentially blow up the size of the disk image however, to
|
||||
prohibitive sizes. In order to address this, it is possible to embed Boot
|
||||
Loader Specification Type #1 entry files in the ESP instead, which may carry
|
||||
references to the UKIs to download and invoke once a choice is made. These
|
||||
references can either be full URLs or alternatively simple filenames which
|
||||
are then automatically suffixed to the URL that was used by the firmware
|
||||
to acquire the initial boot disk.
|
||||
|
||||
3. The `rootfs` can be acquired automatically from a networked source too, in a
|
||||
very flexible fashion. For example, the `initrd` contained in the UKI might
|
||||
support NVMe-over-TCP or iSCSI block devices to boot from, supporting the
|
||||
whole Linux storage stack. `systemd` also natively [supports downloading the
|
||||
`rootfs` from HTTP
|
||||
sources](https://www.freedesktop.org/software/systemd/man/latest/systemd-import-generator.html),
|
||||
either in a GPT disk image (specifically:
|
||||
[DDIs](https://uapi-group.org/specifications/specs/discoverable_disk_image/),
|
||||
with `.raw` suffix) or in a `.tar` file, which are placed in system RAM and
|
||||
then booted into (these downloads can be downloaded in compressed form and
|
||||
are automatically decompressed on-the-fly). This of course requires
|
||||
sufficient RAM to be available on the targeted system, and also implies that
|
||||
persistency of modifications of the file system is not available. If this
|
||||
mode is used the URL to acquire the `rootfs` disk image from can be derived
|
||||
automatically from the URL that was used to acquire the UKI itself. (This
|
||||
information is passed from UKI to userspace via the `LoaderDevicePartURL`
|
||||
EFI variable.)
|
||||
|
||||
Similar to the disk-based boot scheme described in the previous section
|
||||
discovery of the boot source can take place fully automatically, with each
|
||||
component taking the boot source choice of the preceeding component into
|
||||
account: the boot loader can automatically download UKIs from the same source
|
||||
it itself was downloaded from. Moreover the `initrd` of the UKI can
|
||||
automatically downloads the `rootfs` from the same source it itself was
|
||||
downloaded from.
|
||||
|
||||
Also, much like in the disk-based boot scheme it is also possible to
|
||||
specifically direct each of the components to a different place instead of the
|
||||
automatic derivation of URLs from the same source. On top of that it is of
|
||||
course possible to mix disk-based and network-based boot: for example place the
|
||||
boot loader on the local disk, but use UKIs and `rootfs` from networked souces;
|
||||
or alternatively place both boot loader and UKIs on the local disk, and only
|
||||
the `rootfs` on a networked source.
|
||||
|
||||
## Trust & Security
|
||||
|
||||
In a modern boot integrity world, all three of the relevant components as well
|
||||
as (most of) their sidecars require cryptographic protection before they can be
|
||||
consumed. Specifically:
|
||||
|
||||
1. The boot loader is typically authenticated by the firmware before invocation
|
||||
via UEFI Secure Boot, i.e. checked against a cryptographic certificate list
|
||||
persistently stored in the firmware. Note that the various auxiliary
|
||||
resources the `systemd-boot` boot loader reads are not individually
|
||||
authenticated (i.e. `loader.conf` as well as Type #1 Boot Loader
|
||||
Specification entries). Because of this they can typically only be used in a
|
||||
very restricted fashion, i.e. configure some UI details as well as menu
|
||||
entries. Some options available are ignored if Secure Boot mode is
|
||||
enabled. Moreover, even if the text strings shown in the menu entries might
|
||||
not be authenticated the binaries that are invoked once they are selected
|
||||
are, as are all their parameters.
|
||||
|
||||
2. The UKIs are also authenticated by the firmware via UEFI Secure Boot, and so
|
||||
are EFI Addons. `confext` and `sysext` sidecars are protected via
|
||||
`dm-verity` and a signature of the root hash is validated against keys in
|
||||
the kernel's keyrings. System credentials are authenticated via secrets
|
||||
stored in the TPM.
|
||||
|
||||
3. Authentication of the `rootfs` and `usrfs` is more variable: depending on
|
||||
setup this is either done via `dm-verity` (either pinned by root hash from
|
||||
the UKI, or authenticated by signature provided to the kernel, checked
|
||||
against the kernel's built-in keyring) or `dm-crypt`+`dm-integrity`
|
||||
(protected by TPM or user provided password/FIDO/PKCS#11), or in the network
|
||||
case at download time via detached signatures (currently only GPG) or via
|
||||
HTTPS certificate validation. Note that by default the automatic discovery
|
||||
mechanism of the `rootfs` and its auxiliary file systems does not insist on
|
||||
cryptographic protection and authentication before use. However, the
|
||||
[`systemd.image_policy=`](https://www.freedesktop.org/software/systemd/man/latest/systemd.image-policy.html)
|
||||
kernel command line switch may be used to control precisely what kind of
|
||||
protection to require from each such partition.
|
||||
|
||||
Note that UEFI Secure Boot is problematic in various ways: it is generally
|
||||
bound to a certificate list maintained centrally by Microsoft, and thus implies
|
||||
a complex (and expensive) code signing bureaucracy, that in many cases is
|
||||
undesirable, particularly in a community Linux world. Moreover, because the
|
||||
certificate list managed by Microsoft is very large its security value is
|
||||
limited: it mostly acts more as denylist of known bad software rather than as
|
||||
allowlist of known good. (If you enroll your own list things are much better,
|
||||
but see below.)
|
||||
|
||||
### Shim
|
||||
|
||||
To make the code signing more palatable to the Linux world the `shim` project
|
||||
has been developed, which is often used as initial component of the OS boot
|
||||
(i.e. the firmware would invoke `shim` as component 0, before the components 1,
|
||||
2, 3 described above). `shim` is primarily relevant for two reasons: it adds
|
||||
a second set of certificates on top of the UEFI Secure Boot list, maintained by
|
||||
the OS vendor, and it optionally provides functionality to maintain a local set
|
||||
of keys ("MOK") in addition to the Microsoft and OS vendor keys.
|
||||
|
||||
### Automatic Enrollment of Secure Boot Certificates
|
||||
|
||||
The `systemd-boot` boot loader also supports automatic enrollment of
|
||||
alternative SecureBoot certificates: if the system is booted in the
|
||||
firmware-provided Secure Boot "Setup Mode", it can automatically enroll
|
||||
certificates placed inside the ESP into the firmware, replacing any existing
|
||||
ones if there are any. This mechanism massively enhances the security value of
|
||||
Secure Boot: you can enroll your own certificates, ensuring that only the
|
||||
software you want shall be allowed to be run on the system, in a very focussed
|
||||
way. However, do note that this mechanism is only suitable if the hardware is
|
||||
known to be compatible with this mechanism, because the Secure Boot certificate
|
||||
list is also used to authenticate firmware extensions provided by certain
|
||||
extension boards of PCs (for example graphics cards). Or in other words:
|
||||
replacing the certficate list with your own might result in unbootable and even
|
||||
bricked systems. Automatic enrolling of Secure Boot certficates is however a
|
||||
really good option if the targeted harwdare is well known to be compatible with
|
||||
this, which is in particular the case in VMs.
|
||||
|
||||
### Measured Boot and TPMs
|
||||
|
||||
Secure Boot is not the only mechanism that can provide boot time integrity
|
||||
guarantees of the OS. Most modern systems are equipped with a TPM security
|
||||
chip. It allows components of the boot to issue "measurements" (i.e. submission
|
||||
of a cryptographic hash) of the next step of the boot process as well as of all
|
||||
inputs they consume to the device, in a fashion that cannot be undone (except
|
||||
if the system is rebooted). The combination of measurements of all such boot
|
||||
components can then later be used to protect secrets the TPM can manage: only
|
||||
if the system is booted in a very specific way such secrets (such as a disk
|
||||
encryption key) can be revealed to the OS. This hence provides a different form
|
||||
of protection: instead of making it *a-priori* impossible to boot or consume
|
||||
untrusted components (as Secure Boot would do it), anything is permitted,
|
||||
however the TPM would never reveal protected secrets to the OS unless the
|
||||
components are trusted, in a *a-posteriori* fashion. This generally provides a
|
||||
more focussed security model (as the list of allowed components and the
|
||||
policies derived thereof are locally maintained instead of world-wide by
|
||||
Microsoft), however, requires more careful management of OS and firmware
|
||||
updates. Moreover, it's more compatible with a TOFU security model ("Trust on
|
||||
first use") rather than a universal trust model.
|
||||
|
||||
`systemd-stub` will measure the sidecars it picks up as well as the individual
|
||||
parts that make up a UKI that are used for boot. `systemd` userspace will also
|
||||
measure various parts of its resources as does the kernel.
|
||||
|
||||
[`systemd-pcrlock`](https://www.freedesktop.org/software/systemd/man/latest/systemd-pcrlock.html),
|
||||
[`systemd-measure`](https://www.freedesktop.org/software/systemd/man/latest/systemd-measure.html),
|
||||
[`systemd-cryptenroll`](https://www.freedesktop.org/software/systemd/man/latest/systemd-cryptenroll.html),
|
||||
[`systemd-cryptsetup`](https://www.freedesktop.org/software/systemd/man/latest/systemd-cryptsetup.html)
|
||||
can be used to manage Measured Boot policies for disk encryption.
|
||||
|
||||
Note that Secure Boot and Measured Boot are not exclusive to each other, they
|
||||
are often used in combination, and can interact (e.g. the Secure Boot
|
||||
certificate lists are measured as part of the boot process).
|
||||
|
||||
### Security of Network Boot
|
||||
|
||||
UEFI HTTP boot comes in two flavours: plain HTTP and HTTPS. The latter
|
||||
typically requires enrollment of TLS server certificates in the system
|
||||
firmware, but provides transport integrity, authenticity and
|
||||
confidentiality. Acquiring the various resources via plain HTTP should
|
||||
generally be sufficient too though, as the key resources acquired this way
|
||||
area generally authenticated before use via other mechanisms, see above.
|
||||
|
||||
## Building
|
||||
|
||||
The `systemd` project provides various tools to build the various components
|
||||
necessary to implement the aforementioned boot process:
|
||||
|
||||
* The `systemd-boot`, `systemd-stub` components are part of the `systemd`
|
||||
source tree.
|
||||
|
||||
* The
|
||||
[`ukify`](https://www.freedesktop.org/software/systemd/man/latest/ukify.html)
|
||||
tool provided by `systemd` can be used to build UKIs, USIs and EFI Addons.
|
||||
|
||||
* The `bootctl` tool provided by `systemd` can be used to install the
|
||||
`systemd-boot` boot loader to the ESP.
|
||||
|
||||
* The
|
||||
[`kernel-install`](https://www.freedesktop.org/software/systemd/man/latest/kernel-install.html)
|
||||
tool provided by `systemd` can be used to install UKIs to the ESP or
|
||||
XBOOTLDR.
|
||||
|
||||
* The
|
||||
[`systemd-repart`](https://www.freedesktop.org/software/systemd/man/latest/systemd-repart.html)
|
||||
tool can be used to generate `confext` and `sysext` images, as well as
|
||||
`rootfs` and `usrfs` images. It can also be used install such images on other
|
||||
disks, or to augment minimal disk images on boot with additional partitions,
|
||||
or grow them.
|
||||
|
||||
* The
|
||||
[`systemd-creds`](https://www.freedesktop.org/software/systemd/man/latest/systemd-creds.html)
|
||||
tool can be used to generate system credential files.
|
||||
|
||||
* The `systemd-cryptsetup`, `systemd-veritysetup` and `systemd-integritysetup`
|
||||
tools can be used to set up cryptographically protected disks at boot.
|
||||
|
||||
* [`mkosi`](https://github.com/systemd/mkosi) is a higher level tool that
|
||||
combines all of the above to build and sign complete OS images from
|
||||
distribution packages, as needed: `initrd` trees, UKIs, USIs,
|
||||
`rootfs`/`usrfs`, whole DDIs, `sysext` images, `.tar` files to boot into and
|
||||
more.
|
||||
2
docs/rootfs-discovery-flow.svg
Normal file
2
docs/rootfs-discovery-flow.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 61 KiB |
Reference in New Issue
Block a user