Commit Graph

680 Commits

Author SHA1 Message Date
Lennart Poettering
fc3adbbbcb man: always prefix links to uapi specs with their UAPI.XY spec number
Let's try to establish the spec numbers, by mentioning them in most doc
links.

Follow-up for: https://github.com/uapi-group/specifications/pull/187
2025-11-23 18:09:11 +01:00
Christoph Anton Mitterer
07f4718242 man: clarify what “failed” means
systemd.service(5)’s documentation of `ExecCondition=` uses “failed” with
respect to the unit active state.
In particular the unit won’t be considered failed when `ExecCondition=`’s
command exits with a status of 1 through 254 (inclusive). It will however, when
it exits with 255 or abnormally (e.g. timeout, killed by a signal, etc.).

The table “Defined $SERVICE_RESULT values” in systemd.exec(5) uses “failed”
however rather with respect to the condition.

Tests seem to have shown that, if the exit status of the `ExecCondition=`
command is one of 1 through 254 (inclusive), `$SERVICE_RESULT` will be
`exec-condition`, if it is 255, `$SERVICE_RESULT` will be `exit-code` (but
`$EXIT_CODE` and `$EXIT_STATUS` will be empty or unset), if it’s killed because
of `SIGKILL`, `$SERVICE_RESULT` will `signal` and if it times out,
`$SERVICE_RESULT` will be `timeout`.

This commit clarifies the table at least for the case of an exit status of 1
through 254 (inclusive).
The others (signal, timeout and 255 are probably also still ambiguous (e.g.
`signal` uses “A service process”, which could be considered as the actual
service process only).

Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
2025-11-06 10:47:06 +01:00
Quentin Deslandes
79dd24cf14 core: Add UserNamespacePath=
This allows a service to reuse the user namespace created for an
existing service, similarly to NetworkNamespacePath=. The configuration
is the initial user namespace (e.g. ID mapping) is preserved.
2025-11-04 10:55:04 +01:00
Luca Boccassi
e84aa21af8 man: RootImageOptions= is only supported for system services right now
Support via mountfsd is being worked on but will take more time,
fix the documentation to be correct in the meanwhile

Follow-up for fad01f798d
2025-10-22 17:22:03 +01:00
Daniel Foster
c7a444a9c1 tree-wide: extend $LISTEN_FDS protocol with $LISTEN_PIDFDID
Although extremely unlikely, there is a race present in solely checking the
$LISTEN_PID environment variable, due to PID recycling. Fix that by introducing
$LISTEN_PIDFDID, which contains the 64-bit ID of a pidfd for the child process
that is not subject to recycling.
2025-10-22 09:34:14 +02:00
Luca Boccassi
fad01f798d dissect: add support for verity-protected bare filesystems via mountfsd
Needed to implement support for RootHashSignature=/RootVerity=/RootHash=
and friends when going through mountfsd, for example with user units,
so that system and user units provide the same features at the same
level
2025-10-16 16:22:33 +01:00
Luca Boccassi
68b476a298 core: also enable PrivateUsers= for user services when using images via mountfsd
RootDirectory= and other options already implicitly enable PrivateUsers=
since 6ef721cbc7 if they are set in user
units, so that they can work out of the box.
Now with mountfsd support we can do the same for the images settings,
so enable them and document them.
2025-10-16 12:58:59 +01:00
Lennart Poettering
4be269563d core: if we cannot decode a TPM credential skip over it for ImportCredential=
let's skip over credentials we cannot decode when they are found with
ImportCredential=. When installing an OS on some disk and using that
disk on a different machine than assumed we'll otherwise end up with a
broken boot, because the credentials cannot be decoded when starting
systemd-firstboot. Let's handle this somewhat gracefully.

This leaves handling for LoadCredential=/SetCredential= as it is (i.e.
failure to decrypt results in service failure), because it is a lot more
explicit and focussed as opposed to ImportCredentials= which looks
everywhere, uses globs and so on and is hence very vague and unfocussed.

Fixes: #34740
2025-09-18 22:11:57 +02:00
Yu Watanabe
369f311686 man: fix typo
Follow-up for 7aefb194e7.
2025-07-11 14:11:04 +09:00
Matteo Croce
7aefb194e7 man/systemd.exec: explain how BPF token works
Add a small paragraph explaining how BPF token works, how it's being
created and its relationship between the BPF filesystem.
Move all the relevant documentation in the PrivateBPF= section and let
point all the BPFDelegate* options to that one.
2025-07-10 21:40:07 +02:00
Yu Watanabe
f436c64e61 man: fix typo
Follow-up for 7baf403430.
2025-07-10 14:02:00 +09:00
Yu Watanabe
1cf5b39d64 core: add 'DefaultRestrictSUIDSGID' config option (#38126)
closes #37602, see there for extra motivation and considered
alternatives.

On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.

## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
    - VM system boots successfully
    - defaults apply (both `yes`, `no`, and undefined)
    - systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
2025-07-10 13:30:07 +09:00
Matteo Croce
7baf403430 man/systemd.exec: update documentation for PrivateBPF=
Add a short description about what PrivateBPF=yes does
and how it can be useful.
2025-07-10 01:57:14 +02:00
Grimmauld
0316fb8219 core: document 'DefaultRestrictSUIDSGID' 2025-07-09 21:45:46 +02:00
Matteo Croce
ea9826eb94 core: add options to delegate BPFFS token creation
Add four new options BPFDelegate{Commands,Maps,Programs,Attachments}=
in order to delegate to a BPFFS instance the permission to create tokens.

The value is a list of options taken from:
https://github.com/torvalds/linux/blob/v6.14/include/uapi/linux/bpf.h#L922-L1121
The special value "any" means to allow every possible values.

More informations about BPF tokens here:
https://lwn.net/Articles/947173/
2025-07-08 22:35:29 +02:00
Matteo Croce
3a47437fc9 core: Introduce PrivateBPF= to mount a private BPFFS
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
2025-07-08 22:33:28 +02:00
Andres Beltran
26c6f3271a core: add quota support for State, Cache, and Log exec directories 2025-07-07 17:28:47 +00:00
Lennart Poettering
2be3a06bb2 core: when PrivateDevices= is enabled and we need to decrypt TPM2 credentials, go via IPC
Also, if a device ACL list is defined, also go via IPC (instead of
trying to patch it, as before).

The outcome is that the tighter rules continue to apply when configured.

Fixes: #35959
2025-06-24 22:16:01 +02:00
Anton Ryzhov
bd02e15710 man/systemd-creds: fix documentation typo in systemd.exec.xml 2025-06-03 07:42:44 +09:00
Zbigniew Jędrzejewski-Szmek
b082968d19 man: better tags, more links, minor grammar and formatting improvements
Closes https://github.com/systemd/systemd/issues/35751.
2025-05-28 15:35:53 +02:00
Luca Boccassi
6946eed3fa core: Also refresh confext extensions when reloading notify-reload service (#33995)
`ExtensionImages=` and `ExtensionDirectories=` now let you specify
vpick-named extensions; however, since they just get set up once when
the service is started, you can't see newer versions without restarting
the service entirely. Here, also reload confext extensions when you
reload a service. This allows you to deploy a new version of some
configuration and have it picked up at reload time without interruption
to your workload.

Right now, we would only reload confext extensions and leave the sysext
ones behind, since it didn't seem prudent to swap out what is likely
program code at reload. This is made possible by only going for the
`SYSTEMD_CONFEXT_HIERARCHIES` overlays (which only contains `/etc`).

This PR:
- Adjusts `service.c` to also refresh extensions when needed. 
- Adds integration tests to check that a confext reload actually
occurred.
- Adds to the `systemd.exec` man pages to document this behavior.

This is a follow up to #24864 and #31364. Thank you to @bluca and
@goenkam for help in getting this up.
2025-05-20 11:27:34 +01:00
maia x.
67ecc2c7fe man: document confext reload behavior for ExtensionDirectories/Images 2025-05-19 13:36:21 +01:00
Lennart Poettering
bfb1f9e2c9 core: pass the socket cookie to invoked per-connection service instances as $SO_COOKIE env var
The socket cookie is just too useful for identifying connections, let's
emphasize this a bit and pass it as environment variable.
2025-05-15 09:45:32 +02:00
Lennart Poettering
3bdcd994cd man: correct version information when $REMOTE_ADDR/$REMOTE_PORT where added
This was in commit 3b1c524154, i.e. in the
v220 cycle.
2025-05-15 09:45:19 +02:00
Yu Watanabe
8ac5b047fc man/systemd.exec: update documents for PrivateTmp= 2025-05-11 03:33:02 +09:00
Zbigniew Jędrzejewski-Szmek
2dc4e87849 man/systemd.exec: reword description of RestrictAddressFamilies=
The text is reordered and broken into more paragraphs.
A recommendation to combine RestrictAddressFamilies= with
SystemCallFilter=@service is added.
2025-05-06 21:14:03 +02:00
Zbigniew Jędrzejewski-Szmek
802d23fcfb man/systemd.exec: reword description of SystemCallFilter=
The existing text grew organically as features were added and was
not very organized. Reorder it and break into paragraphs grouped
by topic. The description of the :errno syntax is replaced by a short
reference to the SystemCallErrorNumber= setting. This makes the
text shorter and makes it easier to explain how the two settings combine.
2025-05-06 21:14:03 +02:00
Yu Watanabe
4db8663b81 tree-wide: fix typo 2025-04-27 10:36:12 +09:00
Daan De Meyer
ba77798bba unit: Make sure individual unit maximum log level always takes priority
Currently LogLevelMax= can only be used to decrease the maximum log level
for a unit but not to increase it. Let's make sure the latter works as
well, so LogLevelMax=debug can be used to enable debug logging for specific
units without enabling debug logging globally.
2025-04-23 14:46:12 +02:00
Mike Yuan
32b69b190b core: delegate mountns implicitly when any of pidns/cgns/netns is in use 2025-03-30 18:57:18 +02:00
NetSysFire
1f0e4af329 systemd.exec(5): RestrictAddressFamilies: mention address_families(7) 2025-03-11 00:00:55 +09:00
Daan De Meyer
8234cd9989 core: Add DelegateNamespaces=
This delegates one or more namespaces to the service. Concretely,
this setting influences in which order we unshare namespaces. Delegated
namespaces are unshared *after* the user namespace is unshared. Other
namespaces are unshared *before* the user namespace is unshared.

Fixes #35369
2025-03-01 13:54:58 +01:00
Lennart Poettering
7933e971ce pid1: pass pidfdids to invoked services in $MAINPIDFDID and $MANAGERPIDFDID 2025-01-20 21:51:40 +01:00
Lennart Poettering
8af1b296cb pid1: when a password is requested during PAMName= processing, query it via the ask-password logic 2025-01-18 11:45:44 +00:00
Michal Sekletar
f1a0f311e6 man: adjust description of PrivateUsers= so it is in line with reality
When the option is not available unit will not even start so there is
no security risk.

Fixes #34983
2024-12-29 14:38:00 +09:00
Jan Engelhardt
c592ebdf4f man: grammar fixes for introductory adverbs/phrases 2024-12-25 17:24:38 +01:00
Jan Engelhardt
44855c77a1 man: expand word contractions
For written text, contractions are not normally used.
2024-12-25 17:00:31 +01:00
Jan Engelhardt
82ea392a99 man: grammar fixes for "regardless" 2024-12-25 17:00:31 +01:00
Lennart Poettering
4103bf9f2f man: document the new per-use credstore paths
(And some other minor tweaks)
2024-12-20 17:52:07 +01:00
Lennart Poettering
00a415fc8f tree-wide: remove support for kernels lacking ambient caps
Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.

This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
2024-12-17 17:34:46 +01:00
Yu Watanabe
e76fcd0e40 core: make ProtectHostname= optionally take a hostname
Closes #35623.
2024-12-16 23:55:44 +09:00
Luca Boccassi
6dfd290031 core: Add PrivateUsers=full (#35183)
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.

However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases,
users want a full identity mapping from 0 -> UID_MAX.

To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.

Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```

as the init user namedspace uses `0 0 UINT32_MAX` and some applications
- like systemd itself - determine if its a non-init user namespace based
on uid_map/gid_map files.

Note systemd will remove this heuristic in running_in_userns() in
version 258 (https://github.com/systemd/systemd/pull/35382) and uses
namespace inode. But some users may be running a container image with
older systemd < 258 so we keep this hack until version 259 for version
N-1 compatibility.

In addition to mapping the whole UID/GID space, we also set
/proc/pid/setgroups to "allow". While we usually set "deny" to avoid
security issues with dropping supplementary groups
(https://lwn.net/Articles/626665/), this ends up breaking dbus-broker
when running /sbin/init in full OS containers.

Fixes: #35168
Fixes: #35425
2024-12-13 12:25:13 +00:00
Ryan Wilson
2665425176 core: Set /proc/pid/setgroups to allow for PrivateUsers=full
When trying to run dbus-broker in a systemd unit with PrivateUsers=full,
we see dbus-broker fails with EPERM at `util_audit_drop_permissions`.

The root cause is dbus-broker calls the setgroups() system call and this
is disallowed via systemd's implementation of PrivateUsers= by setting
/proc/pid/setgroups = deny. This is done to remediate potential privilege
escalation vulnerabilities in user namespaces where an attacker can remove
supplementary groups and gain access to resources where those groups are
restricted.

However, for OS-like containers, setgroups() is a pretty common API and
disabling it is not feasible. So we allow setgroups() by setting
/proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious
users can still use SystemCallFilter= to disable setgroups() if they want
to specifically prevent this system call.

Fixes: #35425
2024-12-12 11:36:10 +00:00
Yu Watanabe
627d1a9ac1 core: Add ProtectHostname=private (#35447)
This PR allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host. This is useful for OS-like
containers running as units where they should have freedom to change
their container hostname if they want, but not the host's hostname.

Fixes: #30348
2024-12-11 10:17:25 +09:00
Ryan Wilson
219a6dbbf3 core: Fix time namespace in RestrictNamespaces=
RestrictNamespaces= would accept "time" but would not actually apply
seccomp filters e.g. systemd-run -p RestrictNamespaces=time unshare -T true
should fail but it succeeded.

This commit actually enables time namespace seccomp filtering.
2024-12-10 20:55:26 +01:00
Ryan Wilson
cf48bde7ae core: Add ProtectHostname=private
This allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host.

Fixes: #30348
2024-12-06 13:34:04 -08:00
Ryan Wilson
705cc82938 core: Add PrivateUsers=full
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.

However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases, users
want a full identity mapping from 0 -> UID_MAX.

Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```

as the init user namedspace uses `0 0 UINT32_MAX` and some applications -
like systemd itself - determine if its a non-init user namespace based on
uid_map/gid_map files. Note systemd will remove this heuristic in
running_in_userns() in version 258 and uses namespace inode. But some users
may be running a container image with older systemd < 258 so we keep this
hack until version 259.

To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.

Fixes: #35168
2024-12-05 10:34:32 -08:00
Septatrix
5857f31c2c man: clarify wording regarding MONITOR_* envs 2024-12-06 03:01:19 +09:00
Zbigniew Jędrzejewski-Szmek
fe45f8dc9b man: drop whitespace from final <programlisting> lines
In the troff output, this doesn't seem to make any difference. But in the
html output, the whitespace is sometimes preserved, creating an additional
gap before the following content. Drop it everywhere to avoid this.
2024-11-08 14:14:36 +01:00
Lennart Poettering
b711737096 man: document that PrivateTmp= is unaffected by ProtectSystem=strict
Fixes: #33130
2024-11-05 22:57:51 +01:00