systemd.service(5)’s documentation of `ExecCondition=` uses “failed” with
respect to the unit active state.
In particular the unit won’t be considered failed when `ExecCondition=`’s
command exits with a status of 1 through 254 (inclusive). It will however, when
it exits with 255 or abnormally (e.g. timeout, killed by a signal, etc.).
The table “Defined $SERVICE_RESULT values” in systemd.exec(5) uses “failed”
however rather with respect to the condition.
Tests seem to have shown that, if the exit status of the `ExecCondition=`
command is one of 1 through 254 (inclusive), `$SERVICE_RESULT` will be
`exec-condition`, if it is 255, `$SERVICE_RESULT` will be `exit-code` (but
`$EXIT_CODE` and `$EXIT_STATUS` will be empty or unset), if it’s killed because
of `SIGKILL`, `$SERVICE_RESULT` will `signal` and if it times out,
`$SERVICE_RESULT` will be `timeout`.
This commit clarifies the table at least for the case of an exit status of 1
through 254 (inclusive).
The others (signal, timeout and 255 are probably also still ambiguous (e.g.
`signal` uses “A service process”, which could be considered as the actual
service process only).
Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
This allows a service to reuse the user namespace created for an
existing service, similarly to NetworkNamespacePath=. The configuration
is the initial user namespace (e.g. ID mapping) is preserved.
Although extremely unlikely, there is a race present in solely checking the
$LISTEN_PID environment variable, due to PID recycling. Fix that by introducing
$LISTEN_PIDFDID, which contains the 64-bit ID of a pidfd for the child process
that is not subject to recycling.
Needed to implement support for RootHashSignature=/RootVerity=/RootHash=
and friends when going through mountfsd, for example with user units,
so that system and user units provide the same features at the same
level
RootDirectory= and other options already implicitly enable PrivateUsers=
since 6ef721cbc7 if they are set in user
units, so that they can work out of the box.
Now with mountfsd support we can do the same for the images settings,
so enable them and document them.
let's skip over credentials we cannot decode when they are found with
ImportCredential=. When installing an OS on some disk and using that
disk on a different machine than assumed we'll otherwise end up with a
broken boot, because the credentials cannot be decoded when starting
systemd-firstboot. Let's handle this somewhat gracefully.
This leaves handling for LoadCredential=/SetCredential= as it is (i.e.
failure to decrypt results in service failure), because it is a lot more
explicit and focussed as opposed to ImportCredentials= which looks
everywhere, uses globs and so on and is hence very vague and unfocussed.
Fixes: #34740
Add a small paragraph explaining how BPF token works, how it's being
created and its relationship between the BPF filesystem.
Move all the relevant documentation in the PrivateBPF= section and let
point all the BPFDelegate* options to that one.
closes#37602, see there for extra motivation and considered
alternatives.
On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.
## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
- VM system boots successfully
- defaults apply (both `yes`, `no`, and undefined)
- systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
Also, if a device ACL list is defined, also go via IPC (instead of
trying to patch it, as before).
The outcome is that the tighter rules continue to apply when configured.
Fixes: #35959
`ExtensionImages=` and `ExtensionDirectories=` now let you specify
vpick-named extensions; however, since they just get set up once when
the service is started, you can't see newer versions without restarting
the service entirely. Here, also reload confext extensions when you
reload a service. This allows you to deploy a new version of some
configuration and have it picked up at reload time without interruption
to your workload.
Right now, we would only reload confext extensions and leave the sysext
ones behind, since it didn't seem prudent to swap out what is likely
program code at reload. This is made possible by only going for the
`SYSTEMD_CONFEXT_HIERARCHIES` overlays (which only contains `/etc`).
This PR:
- Adjusts `service.c` to also refresh extensions when needed.
- Adds integration tests to check that a confext reload actually
occurred.
- Adds to the `systemd.exec` man pages to document this behavior.
This is a follow up to #24864 and #31364. Thank you to @bluca and
@goenkam for help in getting this up.
The existing text grew organically as features were added and was
not very organized. Reorder it and break into paragraphs grouped
by topic. The description of the :errno syntax is replaced by a short
reference to the SystemCallErrorNumber= setting. This makes the
text shorter and makes it easier to explain how the two settings combine.
Currently LogLevelMax= can only be used to decrease the maximum log level
for a unit but not to increase it. Let's make sure the latter works as
well, so LogLevelMax=debug can be used to enable debug logging for specific
units without enabling debug logging globally.
This delegates one or more namespaces to the service. Concretely,
this setting influences in which order we unshare namespaces. Delegated
namespaces are unshared *after* the user namespace is unshared. Other
namespaces are unshared *before* the user namespace is unshared.
Fixes#35369
Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.
This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.
However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases,
users want a full identity mapping from 0 -> UID_MAX.
To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.
Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```
as the init user namedspace uses `0 0 UINT32_MAX` and some applications
- like systemd itself - determine if its a non-init user namespace based
on uid_map/gid_map files.
Note systemd will remove this heuristic in running_in_userns() in
version 258 (https://github.com/systemd/systemd/pull/35382) and uses
namespace inode. But some users may be running a container image with
older systemd < 258 so we keep this hack until version 259 for version
N-1 compatibility.
In addition to mapping the whole UID/GID space, we also set
/proc/pid/setgroups to "allow". While we usually set "deny" to avoid
security issues with dropping supplementary groups
(https://lwn.net/Articles/626665/), this ends up breaking dbus-broker
when running /sbin/init in full OS containers.
Fixes: #35168Fixes: #35425
When trying to run dbus-broker in a systemd unit with PrivateUsers=full,
we see dbus-broker fails with EPERM at `util_audit_drop_permissions`.
The root cause is dbus-broker calls the setgroups() system call and this
is disallowed via systemd's implementation of PrivateUsers= by setting
/proc/pid/setgroups = deny. This is done to remediate potential privilege
escalation vulnerabilities in user namespaces where an attacker can remove
supplementary groups and gain access to resources where those groups are
restricted.
However, for OS-like containers, setgroups() is a pretty common API and
disabling it is not feasible. So we allow setgroups() by setting
/proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious
users can still use SystemCallFilter= to disable setgroups() if they want
to specifically prevent this system call.
Fixes: #35425
This PR allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host. This is useful for OS-like
containers running as units where they should have freedom to change
their container hostname if they want, but not the host's hostname.
Fixes: #30348
RestrictNamespaces= would accept "time" but would not actually apply
seccomp filters e.g. systemd-run -p RestrictNamespaces=time unshare -T true
should fail but it succeeded.
This commit actually enables time namespace seccomp filtering.
This allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host.
Fixes: #30348
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.
However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases, users
want a full identity mapping from 0 -> UID_MAX.
Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```
as the init user namedspace uses `0 0 UINT32_MAX` and some applications -
like systemd itself - determine if its a non-init user namespace based on
uid_map/gid_map files. Note systemd will remove this heuristic in
running_in_userns() in version 258 and uses namespace inode. But some users
may be running a container image with older systemd < 258 so we keep this
hack until version 259.
To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.
Fixes: #35168
In the troff output, this doesn't seem to make any difference. But in the
html output, the whitespace is sometimes preserved, creating an additional
gap before the following content. Drop it everywhere to avoid this.