In containers securityfs is typically not mounted. Our lsm-bpf code
so far detected this situation and claimed the kernel was lacking
lsm-bpf support. Which isn't quite true though, it might very well
support it. This made boots of systemd in systemd-nspawn a bit ugly,
because of the misleading log message at boot.
Let's improve things, and make clearer what is going on.
This is establish the basic concepts for #35685, in the hope to get this
merged first.
This defines a special, fixed 64K UID range that is supposed to be used
by directory container images on disk, that is mapped to a dynamic UID
range at runtime (via idmapped mounts).
This enables a world where each container can run with a dynamic UID
range, but this in no way leaks onto the disk, thus making supposedly
dynamic, transient UID range assignments persistent.
This is infrastructure later used for the primary part of #35685: unpriv
container execution with directory images inside user's home dirs, that
are assigned to this special "foreign UID range".
This PR only defines the ranges, synthesizes NSS records for them via
userdb, and then exposes them in a new "systemd-dissect --shift" command
that can re-chown a container directory tree into this range (and in
fact any range).
This comes with docs. But no tests. There are tests in #35685 that cover
all this, but they are more comprehensive and also test nspawn's hook-up
with this, hence are excluded from this PR.
If we unexpectly disconnect from the bus, systemd would end up dropping
the list of subscribers, which breaks the ability of clients like logind
to monitor the state of units.
Stash the list of subscribers into the deserialized state in the event
of a disconnect so that when we recover we can renew the broken
subscriptions.
---
Fixes: #8672#26744
If we unexpectly disconnect from the bus, systemd would end up dropping
the list of subscribers, which breaks the ability of clients like logind
to monitor the state of units.
Stash the list of subscribers into the deserialized state in the event
of a disconnect so that when we recover we can renew the broken
subscriptions.
This makes the UID range configurable via build time options, but of
course it really shouldn't be changed. The default range I picked is
outside even of IPAs current (ridiculously large) allocation ranges,
hence hopefully minimizes conflicts.
We enforce quite strict rules on naming userns we assign uid ranges to
for users. So strict that they are hard to get right for clients. hence,
let's optionally mangle provided strings so that they work for us.
This should make it much easier to work with the API, as something
reasonable happens regarldess what kind of garbage a client sets as
name.
mangling the name is opt-in for clients, so that there's tight control
for the client on the name, but also "fire and forget".
Avoid showing the files on the ESP (i.e. a FAT formatted volume) as
executable by removing the execute permission from them.
IMO this makes the colored output of `ls` more sensible since the file
system will be mounted with `noexec` anyway.
Add a `fstype_can_fmask_dmask` function that checks if a file system
type can use the `fmask` and `dmask` mount options.
This replaces `fstype_can_umask` since it was only used in
`partition_pick_mount_options` which only cares about the file system
support for fmask & dmask now.
It somewhat reduces the coverage of the feature since there are more file
systems that support umask as opposed to those supporting dmask & dmask,
but it should not be much of an issue since fmask & dmask are supported
by vfat, exfat and ntfs3.
Let's optionally mangle any passed name on the server side so that it is
useful for identifying a userns, if it isn't suitable for that
right-away. This mostly means truncating it if too long.
It's just too nasty to leave this to the client side, since they'd have
to understand the precise rules for naming userns then.
While we are at it, add full Varlink IDL comments.
This is something I think we should have added a long time ago: a
flavour of open() that safely ensures the inode we are opening is a
regular file, before we open it. It does this by means of pinning the
inode via O_PATH first, and after verification actually opening it.
This ports some code over to this, but sooner or later we should
probably use this a lot more, so that we don't accidentally open weird
stuff such as device nodes or pipes, where we should not.
This corrects the closing sequence for the ConEmu progress reporting
final sequence. We by mistake sent two final ;;, where only one was
expected. The terminals I tested this with didn't care, but Ghostty
apparently does. Let's fix things and generate the closing sequence as
per doc:
https://conemu.github.io/en/AnsiEscapeCodes.html#ConEmu_specific_OSC
Supersedes #35308 (cherry-picked one commit and replaced the rest)
(I left a few comments that's folded by GitHub. Please make sure to
check them too.)
- Make fd_is_namespace() take NamespaceType
- Drop support for kernel without NS_GET_NSTYPE (< 4.11)
- Port is_our_namespace() to namespace_open_by_type()
(preparation for later commits, where the latter
would go by pidfd if available, avoiding procfs)
tpm2_parse_pcr_argument_to_mask() is supposed to parse a PCR mask
string, and uses the full blown tpm2_parse_pcr_argument() call at its
core, which parses more than just a mask, i.e. values and algorithms
too. Which is very confusing at times, because commands such as
"systemd-cryptenroll --tpm2-device=auto
--tpm2-public-key-pcrs=1:sha1=09dbdbc7f6cdd8029cc90b57a915c19a0ac21bce"
are very confusing, since they suggest enrollment with a specific
algorithm and has value, but this is not in fact what happens: both are
entirely ignored.
That this was accepted this way was more an accident than intended,
which is already visible in the fact that extensive test case entirely
ignores the fact that strings like this are accepted.
If TPM2_ALG_ERROR (aka "0") is specified as algorithm in
tpm2_pcr_values_to_mask() we'll simply match all algorithms. This allows
us to shorten tpm2_parse_pcr_argument_to_mask() a bit. The function
accepts but ignores a hash algorithm specification currently, hence this
should not really much effect.
This turns systemd-ask-password into a small Varlink service, so that
there's an standard IPC way to ask for a password. It mostly directly
exposes the functionality of the Varlink service.
This new field allows specification of an fd on which the password
prompt logic will look for POLLHUP events for, and if seen will abort
the query.
The usecase for this is that when we query for a pw on behalf of a
Varlink client we can abort the query automatically if the client dies.