Instead of mounting over, do an atomic swap using mount beneath, if
available. This way assets can be mounted again and again (e.g.:
updates) without leaking mounts.
Currently we spawn services by forking a child process, doing a bunch
of work, and then exec'ing the service executable.
There are some advantages to this approach:
- quick: we immediately have access to all the enourmous amount of
state simply by virtue of sharing the memory with the parent
- easy to refactor and add features
- part of the same binary, will never be out of sync
There are however significant drawbacks:
- doing work after fork and before exec is against glibc's supported
case for several APIs we call
- copy-on-write trap: anytime any memory is touched in either parent
or child, a copy of that page will be triggered
- memory footprint of the child process will be memory footprint of
PID1, but using the cgroup memory limits of the unit
The last issue is especially problematic on resource constrained
systems where hard memory caps are enforced and swap is not allowed.
As soon as PID1 is under load, with no page out due to no swap, and a
service with a low MemoryMax= tries to start, hilarity ensues.
Add a new systemd-executor binary, that is able to receive all the
required state via memfd, deserialize it, prepare the appropriate
data structures and call exec_child.
Use posix_spawn which uses CLONE_VM + CLONE_VFORK, to ensure there is
no copy-on-write (same address space will be used, and parent process
will be frozen, until exec).
The sd-executor binary is pinned by FD on startup, so that we can
guarantee there will be no incompatibilities during upgrades.
This reworks the image discovery logic, and conceptually allows DDIs
that are both confext and sysext to exist. Previously we'd only extract
one type of exension data from a DDI, with this we allow to extract both
if both exist.
This doesn't add support for true "multi-modal" DDIs, that qualify as
various things at once, it just lays some ground work that ensures we at
least can dissect such images.
This reworks 484d26dac1 quite a bit.
This changes systemd-dissect's JSON output, but given the
version with the fields it changes/dops has never been released (as the
above patch was merged post-v254) this shouldn't be an issue.
In TEST-70-TPM2, test systemd-cryptenroll --tpm2-seal-key-handle using the
default (0) as well as the SRK handle (0x81000001), and test using a non-SRK
handle index after creating and persisting a primary key.
In test/test-tpm2, test tpm2_seal() and tpm2_unseal() using default (0), the SRK
handle, and a transient handle.
Otherwise they might leave stuff behind if they don't respond fast
enough to the first SIGTERM and get SIGKILLEd, which then breaks reusing
the unit name further in the test:
[ 2993.620849] H testsuite-82.sh[43]: + systemd-run -p Type=exec -p DefaultDependencies=no -p IgnoreOnIsolate=yes --unit=testsuite-82-nosurvive.service sleep infinity
[ 2993.628686] H systemd[1]: testsuite-82-nosurvive.service: About to execute: /usr/bin/sleep infinity
[ 2993.628886] H systemd[1]: testsuite-82-nosurvive.service: Forked /usr/bin/sleep as 65
[ 2993.629328] H systemd[1]: testsuite-82-nosurvive.service: Changed dead -> start
...
[ 2993.699892] H testsuite-82.sh[43]: + systemctl --no-block --check-inhibitors=yes soft-reboot
[ 2993.704326] H systemd-logind[41]: The system will soft-reboot now!
...
[ 3001.249302] H systemd[1]: Sending SIGKILL to PID 65 (sleep).
...
[ 3001.303158] H testsuite-82.sh[136]: + systemd-notify '--status=Second Boot'
...
[ 3001.409504] H testsuite-82.sh[136]: + systemd-run -p Type=exec --unit=testsuite-82-nosurvive.service sleep infinity
[ 3001.414061] H testsuite-82.sh[165]: Failed to start transient service unit: Unit testsuite-82-nosurvive.service was already loaded or has a fragment file.
Spotted in Ubuntu CI.
In discover_next_boot(), first we find a new boot ID based on the value
stored in the entry object. Then, find the tail (or head when we are going
upwards) entry of the boot based on the _BOOT_ID= field data.
If boot IDs of an entry in the entry object and _BOOT_ID field data
are inconsistent, which may happen on corrupted journal, then previously
discover_next_boot() failed with -ENODATA.
This makes the function check if the two boot IDs in each entry are
consistent, and skip the entry if not.
Fixes the failure of `journalctl -b -1` for 'truncated' journal:
https://github.com/systemd/systemd/pull/29334#issuecomment-1736567951
Doing that in VMs without acceleration is prohibitively expensive (i.e.
20+ seconds in the C8S job). Thankfully, the recent [0] --lines=+n syntax
makes this all quite easy to fix.
[0] 8d6791d2aa
Since the soft-reboot drops the enqueued end.service, we won't shutdown
the test VM if the test fails and have to wait for the watchdog to kill
us (which may take quite a long time). Let's just forcibly kill the
machine instead to save CI resources.
This adds an explicit service for initializing the TPM2 SRK. This is
implicitly also done by systemd-cryptsetup, hence strictly speaking
redundant, but doing this early has the benefit that we can parallelize
this in a nicer way. This also write a copy of the SRK public key in PEM
format to /run/ + /var/lib/, thus pinning the disk image to the TPM.
Making the SRK public key is also useful for allowing easy offline
encryption for a specific TPM.
Sooner or later we should probably grow what this service does, the
above is just the first step. For example, the service should probably
offer the ability to reset the TPM (clear the owner hierarchy?) on a
factory reset, if such a policy is needed. And we might want to install
some default AK (?).
Fixes: #27986
Also see: #22637