I was looking into a question posed in one of the Fedora discussion threads:
is it OK for a package to assume that files in different directories under /usr
are always on the same mount point? rpmlint emits a warning if a package has
files that are hardlinked between directories, i.e. rpmlint thinks that this
is not the case. But in practice, our systems are like this and our tooling
generally doesn't expect a part of /usr to be separated out. I looked at the
MOUNT_REQUIREMENTS document, but it doesn't answer this question clearly.
It was clearly written with the assumption that e.g. "/usr/" or "/var/" are one
mount point, so when it is "mounted", all of it is available. But the document
also talks about submounts being pulled in through requirements on specific
units, which requires some mounts not to be mounted all at once, so the reader
is left without any direct answer to this question.
This rewrite makes the following changes:
- rename "generally three categories of requirements" to
"three general categories of mount points" because we're categorizing
mount points, not requirements.
- always repeat the category name in further mentions,
e.g. "2/early" instead of just "2" so the reader doesn't have to jump
back to the table when reading.
- mention that it is OK for a mount point to be not split out
- say that submount which is "conceptually separate" may be mounted
later.
- say "ephemeral system" instead of "stateless system" and split out
the description of those systems into a separate paragraph and clearly
state that they are an exception that skips the requirements listed in
this document.
- be consistent in specifying the boundary before which each category must
have been mounted. Previously, cat. 1 was described as "before transisition"
and cat. 2 was described as "during early boot", which created the additional
problem that later we needed to contradict this saying that "must be mounted
during early boot" doesn't actually mean that and this can be done ealier.
If we say "before end of early boot", we avoid this awkwardness.
See https://gitlab.gnome.org/GNOME/glib/-/issues/2931 for the changes in
GLib upstream. Using `GMemoryMonitor` is now more compliant with the
systemd recommended approach, but it needs further work to read the
recommended environment variables rather than unconditionally accessing
the per-cgroup PSI kernel file directly.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
We do this in a separate service (rather than inside of
systemd-tpm2-setup), since we want failures of this measurement to
result in an instant reboot, like for most our measurements.
Failures to initialize nvpcrs, or allocate an SRK are somewhat OK (and
more likely), as long as this separator communicates clearly where they
have to have taken place, if they worked.
This locks down NvPCR initilization a bit more: we'll measure each
initialization of an NvPCR into PCR 9, thus chaining the NvPCRs to the
PCR set. After all NvPCRs are initialized we measure a barrier into PCR
9 as well.
This ensures that later additions of NvPCRs are clearly recognizable and
distuingishable from those done at boot.
Load modules in parallel using a pool of worker threads. The number of
threads is equal to the number of CPUs, with a maximum of 16 (to avoid
too many threads being started during boot on systems with many an high
core count, since the number of modules loaded on boot is usually on
the small side).
The number of threads can optionally be specified manually using the
SYSTEMD_MODULES_LOAD_NUM_THREADS environment variable; in this case,
no limit is enforced. If SYSTEMD_MODULES_LOAD_NUM_THREADS is set to 0,
probing happens sequentially.
Co-authored-by: Eric Curtin <ecurtin@redhat.com>
This has been tripping up container manager people. let's document this
explicitly.
(Note that the container interface could really use some updates, i.e.
it was written before a time where cgroup namespacing was a thing. But I
am too lazy to fix that now, so let's just add this once facet.)
udev's block device locking protocol has one pitfall not even the
example in the documentation got right so far (even though this is
explained in all detail above): udev's rescanning is only triggered when
an fd that is opened for writing is closed. This means that if a
separate locking fd is opened on a block device – one that is maintained
independently of the fd actually used for writing – it must be opened for
writing too, so that closing the lock definitely triggers a rescan. This
matters in cases where the lock fd is kept for longer than the fd used
for writing to disk. (Because otherwise udev might get the
IN_CLOSE_WRITE event, but when it tries to rescan will find the device
locked, and never retry because no IN_CLOSE_WRITE is triggred anymore.)
Let's fix that across the codebase, at 4 places:
1. in makefs (a lock fd is kept, and mkfs then invoked as child, which
uses a different fd, and the lock fd is closed only once the child
died)
2. in udevadm lock (embarassing!): which is intended to be used to wrap tools
that modify disk contents, very similar to the makefs case. The lock
is also kept until after the tool exited.
3. In storagetm: the kernel nvme-tcp layer writes to the device
directly, we just keep a lock fd.
4. the example in BLOCK_DEVICE_LOCKING.md
Although extremely unlikely, there is a race present in solely checking the
$LISTEN_PID environment variable, due to PID recycling. Fix that by introducing
$LISTEN_PIDFDID, which contains the 64-bit ID of a pidfd for the child process
that is not subject to recycling.
By giving priority to --background= we prevent users from opting
out of coloring if an explicit color is chosen by a tool wrapping
one of our own tools. Instead, let's give priority to the environment
variable, so that even if our tools are wrapped by another tool with
a different background, users can still opt out of coloring just by
setting the environment variable, which has a high chance of being
forwarded to the invocation of our own tools which makes it easy to
use to disable color tinting globally if requested by the user.
0x1770 is 6000, not 60000. It looks like 60000 is intended (the next
range starts at 60000 in both decimal and hex), so use that.
1000 to 60000 is 59001 users, as the range is inclusive on both sides.
Similar off-by-one for one of the "unused" ranges. After these changes,
the sizes of the ranges up to and including the "-1" ID sum up to 65536,
as expected.
I'm not sure where the size of the unused range after the container UID
range came from, but it is not correct (the "Container UID" and this
reserved range combined would be larger than the "HIC SVNT LEONES" 2^31
to 2^32-2 range...). Fix it.
It is unfortunate that the first half of this table makes more sense in
decimal while the second half makes more sense in hex (which would also
make the size in 65536 chunks easy to obtain): I'm tempted to add a
"sizes in hex" column...
Let's not leak details from src/shared and src/libsystemd into
src/basic, even though you can't actually do anything useful with
just forward declarations from src/shared.
The sd-forward.h header is put in src/libsystemd/sd-common as we
don't have a directory for shared internal headers for libsystemd
yet.
Let's also rename forward.h to basic-forward.h to keep things
self-explanatory.
Let's reduce our attack surface by insisting that XBOOTLDR is vfat when
auto-probing, just like we do for the ESP. Given neither can
realistically be integrity protected (because firmware needs to access
them) let's insist on a vfat which has a much smaller attack surface,
and one we have to accept (for now) anyway, given that the ESP must be
VFAT.
This only applies to auto-probing of course. If people mount things
explicitly via fstab none of this matters. But we really shouldn't
automount a btrfs/xfs/ext4 partition as XBOOTLDR just because it looks
like one, as that would really defeat our otherwise possibly very strict
image policies.
This also introduces a new env var $SYSTEMD_DISSECT_FSTYPE_<DESIGNATOR>
environment variable that may override this hardcoding. This is in
particular useful in our testcases, since various actually do use ext4
as XBOOTLDR case. The tests are updated to make use of the new env var,
both as a mechanism to test this and to keep the tests working.
Sharing verity volumes is problematic for a veriety of reasons, for
example because it might pin the wrong backing device at the wrong time.
Let's hence turn this around: unless verity sharing is enabled, leave it
off, and turn $SYSTEMD_VERITY_SHARING into a true boolean that can be
set both ways.
The primary usecase for verity sharing is RootImage=, where it probably
makes sense to leave on, hence set the flag there.
This is crucial when putting together installers which install an OS on
a second disk: if verity sharing is always on we might mount the wrong
of the two disks at the wrong time.
Followon to #37024.
This implements (mostly) what was suggested there, except that only a
single UUID is accepted (modifying things to support multiple is a
relatively straightforward change from here)
I'm not really convinced this is the right approach:
* I can't really think of any cases where you'd need to query by
multiple UUIDs (I guess you might want to lookup multiple users, but in
that case why aren't there "usernames" or "uids" arrays?)
* If I specify username "foo" and UID 1234 and UID 1234 exists and has
username "bar", I get back the error `ConflictingRecordFound`
* If I specify username "foo" and UUID abcdef... and username "foo"
exists but has UUID 123456..., I get back the error
`NonMatchingRecordFound`
This makes the two ID types behave differently.
Additionally, when querying by `uuid`, the multiplexer will always sends
`more: true`, which is fine but a little unexpected.
I do think unifying things through the `UserDBMatch` struct could make
sense, but in that case I think it would make sense to unify all query
types in that way (username, uid, uuid), identify when the filter is for
a single or multiple records, and centralise determination of conflict
vs non matching record errors.
`userdb_by_name`/`userdb_by_uid` could then become helper functions for
the simple case where no additional filtering is needed.
Thoughts?
One other thought: Should the multiplexer just pass through all
parameters, even unknown ones, to the backend services? Even if it
doesn't know how to filter by every property, the backends might, and it
would be useful to allow them to optimise things. (I realise the
disadvantage of this, ofc, is loss of error checking)
Container managers may want to bind mount the root filesystem
somewhere within the container. Security-wise, this is very much not
recommended, but it may be something application containers may want
to do nonetheless.
Ref: https://github.com/flatpak/flatpak/pull/6125#issuecomment-2759378603