Currently we spawn services by forking a child process, doing a bunch
of work, and then exec'ing the service executable.
There are some advantages to this approach:
- quick: we immediately have access to all the enourmous amount of
state simply by virtue of sharing the memory with the parent
- easy to refactor and add features
- part of the same binary, will never be out of sync
There are however significant drawbacks:
- doing work after fork and before exec is against glibc's supported
case for several APIs we call
- copy-on-write trap: anytime any memory is touched in either parent
or child, a copy of that page will be triggered
- memory footprint of the child process will be memory footprint of
PID1, but using the cgroup memory limits of the unit
The last issue is especially problematic on resource constrained
systems where hard memory caps are enforced and swap is not allowed.
As soon as PID1 is under load, with no page out due to no swap, and a
service with a low MemoryMax= tries to start, hilarity ensues.
Add a new systemd-executor binary, that is able to receive all the
required state via memfd, deserialize it, prepare the appropriate
data structures and call exec_child.
Use posix_spawn which uses CLONE_VM + CLONE_VFORK, to ensure there is
no copy-on-write (same address space will be used, and parent process
will be frozen, until exec).
The sd-executor binary is pinned by FD on startup, so that we can
guarantee there will be no incompatibilities during upgrades.
Let's mention that we just need the latest stable release of mkosi,
not the latest git commit. We also split the instructions for building
on the host and the instructions for building with mkosi into two blocks,
as it's not required to build on the host anymore to build with mkosi.
On normal systems, triggering a timeout should be a bug in code or
configuration error, so I do not think we should extend the default
timeout. Also, we should not introduce a 'first class' configuration
option about that. But, making it configurable may be useful for cases
such that "an extremely highly utilized system (lots of OOM kills,
very high CPU utilization, etc)".
Closes#25441.
The tool initially just measured the boot phase, but was subsequently
extended to measure file system and machine IDs, too. At AllSystemsGo
there were request to add more, and make the tool generically
accessible.
Hence, let's rename the binary (but not the pcrphase services), to make
clear the tool is not just measureing the boot phase, but a lot of other
things too.
The tool is located in /usr/lib/ and still relatively new, hence let's
just rename the binary and be done with it, while keeping the unit names
stable.
While we are at it, also move the tool out of src/boot/ and into its own
src/pcrextend/ dir, since it's not really doing boot related stuff
anymore.
ispell made some suggestions which I applied.
Addresses: https://github.com/systemd/systemd/pull/29209#pullrequestreview-1632623460
Also adds a brief paragraph about initrd transitions. (Plymouth really
should start using the fdstore for pinning DRM objects, and stop trying
to survive the initrd→host transition)
There were a couple spelling/grammatical errors in the docs that made
it hard to read and understand parts of this doc. I cleaned up those
errors and reflowed the line breaks to keep to the 80 char limit.
The article "a" goes before consonant sounds and "an" goes before vowel
sounds. This commit changes an to a for UKI, UDP, UTF-8, URL, UUID, U-Label, UI
and USB, since they start with the sound /ˌjuː/.
Since mkosi is now smart enough to drop the caches when the list of
packages changes, let's enable Incremental= mode by default to ensure
a good experience for anyone new to hacking on systemd with mkosi.
ImportCredential= takes a credential name and searches for a matching
credential in all the credential stores we know about it. It supports
globs which are expanded so that all matching credentials are loaded.
The kernel, systemd, and many other things print their version during boot.
sd-boot and sd-stub are also important, so let's print the version if EFI_DEBUG.
(If !EFI_DEBUG, continue to be quiet.)
When updating the docs, I saw that that the text in HACKING.md was out of date.
Instead of trying to update the instructions there, make it shorter and refer
the reader to tools/debug-sd-boot.sh for details.
Before 7cd43e34c5, it was possible to use
SYSTEMD_PROC_CMDLINE=systemd.condition-first-boot to override autodetection.
But now this doesn't work anymore, and it's useful to be able to do that for
testing.
We provide the same stability for all the headers that are public.
Also, mark id128 as portable to other systems. There is really nothing in the
code that would make it hard. It would probably work out-of-the-box.
Let's start moving towards a more involved partitioning setup to
test our stuff more when using mkosi.
The root partition is generated on boot with systemd-repart.
CentOS supports neither erofs nor btrfs so we use squashfs and xfs
instead.
We also enable SecureBoot= locally for additional coverage. This
and the use of verity means users need to run `mkosi genkey` once
to generate the keys necessary to do secure boot and verity.
This implements a minimal subset of #24961, but in a lot more
restrictive way: we only allow one level of subcgroup (as that's enough
to address the no-processes in inner cgroups rule), and does not change
anything about threaded cgroup logic or similar, or make any of this new
behaviour mandatory.
All this does is this: all non-control processes we invoke for a unit
we'll invoke in a subgroup by the specified name.
We'll later port all our current services that use cgroup delegation
over to this, i.e. user@.service, systemd-nspawn@.service and
systemd-udevd.service.