Background: Fedora/RHEL are switching to sysusers.d metadata for
creation of users and groups for system users defined by packages
(https://fedoraproject.org/wiki/Changes/RPMSuportForSystemdSysusers).
Packages carry sysusers files. During package installation, rpm calls an
program to execute on this config. This program may either be
/usr/lib/rpm/sysusers.sh which calls useradd/groupadd, or
/usr/bin/systemd-sysusers. To match the functionality provided by
useradd/groupadd from the shadow-utils project, systemd-sysusers must
emit audit events so that it provides a drop-in replacement.
systemd-sysuers will emit audit events AUDIT_ADD_USER/AUDIT_ADD_GROUP
when adding users and groups. The operation "names" are copied from
shadow-utils, so the format of the events that is generated on success
should be identical. On failure, things are more complicated. We write
the whole file at once, once, so we first generate "success" messages
for each entry, then we try to write the files, and if things fail, we
generate failure messages to all entries that we failed to write.
Background: Fedora/RHEL are switching to sysusers.d metadata for creation of
users and groups for system users defined by packages
(https://fedoraproject.org/wiki/Changes/RPMSuportForSystemdSysusers).
Packages carry sysusers files. During package installation, rpm calls an
program to execute on this config. This program may either be
/usr/lib/rpm/sysusers.sh which calls useradd/groupadd, or
/usr/bin/systemd-sysusers. To match the functionality provided by
useradd/groupadd from the shadow-utils project, systemd-sysusers must emit
audit events so that it provides a drop-in replacement.
systemd-sysuers will emit audit events AUDIT_ADD_USER/AUDIT_ADD_GROUP when
adding users and groups. The operation "names" are copied from shadow-utils in
Fedora (which has a patch to change them from the upstream version), so the
format of the events that is generated on success should be identical.
The helper code is shared between sysusers and utmp-wtmp. I changed the
audit_fd variable to be unconditional. This way we can avoid ugly iffdefery
every time the variable would be used. The cost is that 4 bytes of unused
storage might be present. This is negligible, and the compiler might even be
able to optimize that away if it inlines things.
The arm confidential compute architecture (CCA) provides a platform design for
confidential VMs running in a new realm context.
This can be detected by the existence of a platform device exported for the
arm-cca-guest driver, which provides attestation services via the realm
services interface (RSI) to the Realm Management Monitor (RMM).
Like the other methods systemd uses to detect Confidential VM's, checking
the sysfs entry suggests that this is a confidential VM and should only be
used for informative purposes, or to trigger further attestation.
Like the s390 detection logic, the sysfs path being checked is not labeled
as ABI, and may change in the future. It was chosen because its
directly tied to the kernel's detection of the realm service interface rather
to the Trusted Security Module (TSM) which is what is being triggered by the
device entry. The TSM module has a provider string of 'arm-cca-guest' which
could also be used, but that (IMHO) doesn't currently provide any additional
benefit except that it can fail of the module isn't loaded.
More information can be found here:
https://developer.arm.com/documentation/den0125/0300
Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
(This is useful for the test case added in the next commit, where it's
kinda nice being able to use pidref_safe_fork_full() and acquiring a
pidref of the child in the child in one go. There's no other value in
this than a bit of synctactic sugar for that test. But otoh thre's no
good reason to prohibit FORK_WAIT use like this, hence either way, this
commit should be a good thing.)
This is establish the basic concepts for #35685, in the hope to get this
merged first.
This defines a special, fixed 64K UID range that is supposed to be used
by directory container images on disk, that is mapped to a dynamic UID
range at runtime (via idmapped mounts).
This enables a world where each container can run with a dynamic UID
range, but this in no way leaks onto the disk, thus making supposedly
dynamic, transient UID range assignments persistent.
This is infrastructure later used for the primary part of #35685: unpriv
container execution with directory images inside user's home dirs, that
are assigned to this special "foreign UID range".
This PR only defines the ranges, synthesizes NSS records for them via
userdb, and then exposes them in a new "systemd-dissect --shift" command
that can re-chown a container directory tree into this range (and in
fact any range).
This comes with docs. But no tests. There are tests in #35685 that cover
all this, but they are more comprehensive and also test nspawn's hook-up
with this, hence are excluded from this PR.
This makes sure when we are blocking signals in preparation for fork()
we'll not temporarily unblock any signals previously set, by mistake.
It's safe for us to block more, but not to unblock signals already
blocked. Fix that.
Fixes: #35470
This makes the UID range configurable via build time options, but of
course it really shouldn't be changed. The default range I picked is
outside even of IPAs current (ridiculously large) allocation ranges,
hence hopefully minimizes conflicts.
Avoid showing the files on the ESP (i.e. a FAT formatted volume) as
executable by removing the execute permission from them.
IMO this makes the colored output of `ls` more sensible since the file
system will be mounted with `noexec` anyway.
Add a `fstype_can_fmask_dmask` function that checks if a file system
type can use the `fmask` and `dmask` mount options.
This replaces `fstype_can_umask` since it was only used in
`partition_pick_mount_options` which only cares about the file system
support for fmask & dmask now.
It somewhat reduces the coverage of the feature since there are more file
systems that support umask as opposed to those supporting dmask & dmask,
but it should not be much of an issue since fmask & dmask are supported
by vfat, exfat and ntfs3.
This makes sure the the codepath that derives an nsfd from a pid works
the same for the pidfd case and the non-pidfd case: if we can verify
that /proc/ is mounted but the /proc/$PID/ns/ files are missing, we can
assume the ns type is not supported by the kernel. Hence return the same
ENOPKG error in this case as we already do in the pidfd ioctl based
codepath.
As requested, a list of kernel version to feature mapping
for kernels older than minimum baseline is also included,
in order to ease potential backport work.
This is something I think we should have added a long time ago: a
flavour of open() that safely ensures the inode we are opening is a
regular file, before we open it. It does this by means of pinning the
inode via O_PATH first, and after verification actually opening it.
This ports some code over to this, but sooner or later we should
probably use this a lot more, so that we don't accidentally open weird
stuff such as device nodes or pipes, where we should not.
Supersedes #35308 (cherry-picked one commit and replaced the rest)
(I left a few comments that's folded by GitHub. Please make sure to
check them too.)