See https://gitlab.gnome.org/GNOME/glib/-/issues/2931 for the changes in
GLib upstream. Using `GMemoryMonitor` is now more compliant with the
systemd recommended approach, but it needs further work to read the
recommended environment variables rather than unconditionally accessing
the per-cgroup PSI kernel file directly.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
We do this in a separate service (rather than inside of
systemd-tpm2-setup), since we want failures of this measurement to
result in an instant reboot, like for most our measurements.
Failures to initialize nvpcrs, or allocate an SRK are somewhat OK (and
more likely), as long as this separator communicates clearly where they
have to have taken place, if they worked.
This locks down NvPCR initilization a bit more: we'll measure each
initialization of an NvPCR into PCR 9, thus chaining the NvPCRs to the
PCR set. After all NvPCRs are initialized we measure a barrier into PCR
9 as well.
This ensures that later additions of NvPCRs are clearly recognizable and
distuingishable from those done at boot.
Load modules in parallel using a pool of worker threads. The number of
threads is equal to the number of CPUs, with a maximum of 16 (to avoid
too many threads being started during boot on systems with many an high
core count, since the number of modules loaded on boot is usually on
the small side).
The number of threads can optionally be specified manually using the
SYSTEMD_MODULES_LOAD_NUM_THREADS environment variable; in this case,
no limit is enforced. If SYSTEMD_MODULES_LOAD_NUM_THREADS is set to 0,
probing happens sequentially.
Co-authored-by: Eric Curtin <ecurtin@redhat.com>
This has been tripping up container manager people. let's document this
explicitly.
(Note that the container interface could really use some updates, i.e.
it was written before a time where cgroup namespacing was a thing. But I
am too lazy to fix that now, so let's just add this once facet.)
udev's block device locking protocol has one pitfall not even the
example in the documentation got right so far (even though this is
explained in all detail above): udev's rescanning is only triggered when
an fd that is opened for writing is closed. This means that if a
separate locking fd is opened on a block device – one that is maintained
independently of the fd actually used for writing – it must be opened for
writing too, so that closing the lock definitely triggers a rescan. This
matters in cases where the lock fd is kept for longer than the fd used
for writing to disk. (Because otherwise udev might get the
IN_CLOSE_WRITE event, but when it tries to rescan will find the device
locked, and never retry because no IN_CLOSE_WRITE is triggred anymore.)
Let's fix that across the codebase, at 4 places:
1. in makefs (a lock fd is kept, and mkfs then invoked as child, which
uses a different fd, and the lock fd is closed only once the child
died)
2. in udevadm lock (embarassing!): which is intended to be used to wrap tools
that modify disk contents, very similar to the makefs case. The lock
is also kept until after the tool exited.
3. In storagetm: the kernel nvme-tcp layer writes to the device
directly, we just keep a lock fd.
4. the example in BLOCK_DEVICE_LOCKING.md
Although extremely unlikely, there is a race present in solely checking the
$LISTEN_PID environment variable, due to PID recycling. Fix that by introducing
$LISTEN_PIDFDID, which contains the 64-bit ID of a pidfd for the child process
that is not subject to recycling.
By giving priority to --background= we prevent users from opting
out of coloring if an explicit color is chosen by a tool wrapping
one of our own tools. Instead, let's give priority to the environment
variable, so that even if our tools are wrapped by another tool with
a different background, users can still opt out of coloring just by
setting the environment variable, which has a high chance of being
forwarded to the invocation of our own tools which makes it easy to
use to disable color tinting globally if requested by the user.
0x1770 is 6000, not 60000. It looks like 60000 is intended (the next
range starts at 60000 in both decimal and hex), so use that.
1000 to 60000 is 59001 users, as the range is inclusive on both sides.
Similar off-by-one for one of the "unused" ranges. After these changes,
the sizes of the ranges up to and including the "-1" ID sum up to 65536,
as expected.
I'm not sure where the size of the unused range after the container UID
range came from, but it is not correct (the "Container UID" and this
reserved range combined would be larger than the "HIC SVNT LEONES" 2^31
to 2^32-2 range...). Fix it.
It is unfortunate that the first half of this table makes more sense in
decimal while the second half makes more sense in hex (which would also
make the size in 65536 chunks easy to obtain): I'm tempted to add a
"sizes in hex" column...
Let's not leak details from src/shared and src/libsystemd into
src/basic, even though you can't actually do anything useful with
just forward declarations from src/shared.
The sd-forward.h header is put in src/libsystemd/sd-common as we
don't have a directory for shared internal headers for libsystemd
yet.
Let's also rename forward.h to basic-forward.h to keep things
self-explanatory.
Let's reduce our attack surface by insisting that XBOOTLDR is vfat when
auto-probing, just like we do for the ESP. Given neither can
realistically be integrity protected (because firmware needs to access
them) let's insist on a vfat which has a much smaller attack surface,
and one we have to accept (for now) anyway, given that the ESP must be
VFAT.
This only applies to auto-probing of course. If people mount things
explicitly via fstab none of this matters. But we really shouldn't
automount a btrfs/xfs/ext4 partition as XBOOTLDR just because it looks
like one, as that would really defeat our otherwise possibly very strict
image policies.
This also introduces a new env var $SYSTEMD_DISSECT_FSTYPE_<DESIGNATOR>
environment variable that may override this hardcoding. This is in
particular useful in our testcases, since various actually do use ext4
as XBOOTLDR case. The tests are updated to make use of the new env var,
both as a mechanism to test this and to keep the tests working.
Sharing verity volumes is problematic for a veriety of reasons, for
example because it might pin the wrong backing device at the wrong time.
Let's hence turn this around: unless verity sharing is enabled, leave it
off, and turn $SYSTEMD_VERITY_SHARING into a true boolean that can be
set both ways.
The primary usecase for verity sharing is RootImage=, where it probably
makes sense to leave on, hence set the flag there.
This is crucial when putting together installers which install an OS on
a second disk: if verity sharing is always on we might mount the wrong
of the two disks at the wrong time.
Followon to #37024.
This implements (mostly) what was suggested there, except that only a
single UUID is accepted (modifying things to support multiple is a
relatively straightforward change from here)
I'm not really convinced this is the right approach:
* I can't really think of any cases where you'd need to query by
multiple UUIDs (I guess you might want to lookup multiple users, but in
that case why aren't there "usernames" or "uids" arrays?)
* If I specify username "foo" and UID 1234 and UID 1234 exists and has
username "bar", I get back the error `ConflictingRecordFound`
* If I specify username "foo" and UUID abcdef... and username "foo"
exists but has UUID 123456..., I get back the error
`NonMatchingRecordFound`
This makes the two ID types behave differently.
Additionally, when querying by `uuid`, the multiplexer will always sends
`more: true`, which is fine but a little unexpected.
I do think unifying things through the `UserDBMatch` struct could make
sense, but in that case I think it would make sense to unify all query
types in that way (username, uid, uuid), identify when the filter is for
a single or multiple records, and centralise determination of conflict
vs non matching record errors.
`userdb_by_name`/`userdb_by_uid` could then become helper functions for
the simple case where no additional filtering is needed.
Thoughts?
One other thought: Should the multiplexer just pass through all
parameters, even unknown ones, to the backend services? Even if it
doesn't know how to filter by every property, the backends might, and it
would be useful to allow them to optimise things. (I realise the
disadvantage of this, ofc, is loss of error checking)
Container managers may want to bind mount the root filesystem
somewhere within the container. Security-wise, this is very much not
recommended, but it may be something application containers may want
to do nonetheless.
Ref: https://github.com/flatpak/flatpak/pull/6125#issuecomment-2759378603
Those lists were partially wrong and partially outdated. We should generate
this document automatically, but let's revisit this topic after the conversion
to sphinx. For now, as a stop-gap solution, I generated the lists from
the new 'systemd-analyze transient-settings' command.
Currently, when running "systemctl preset-all --root=xxx" in mkosi
to enable/disable units for initrds, the system presets are used.
The problem with this approach is that the system presets are written
for the system, and that is not necessarily ideal for an initrd, but we
still want to use the same packages in the initrd that we install in the
system, so let's introduce a separate directory for initrd presets which
is used to pick up preset files from when we detect that we're configuring
an initrd (by looking for /etc/initrd-release).
We also introduce a systemd preset file for the initrd, which is based on
the system one, except with all the stuff unnecessary for the initrd removed.
[1] says:
> Since 0.60.0 the name argument is optional and defaults to the basename of
> the first output
We specify >= 0.62 as the supported version, so drop the duplicate name in all cases
where it is the same as outputs[0], i.e. almost all cases.
[1] https://mesonbuild.com/Reference-manual_functions.html#custom_target