Commit Graph

672 Commits

Author SHA1 Message Date
Yu Watanabe
369f311686 man: fix typo
Follow-up for 7aefb194e7.
2025-07-11 14:11:04 +09:00
Matteo Croce
7aefb194e7 man/systemd.exec: explain how BPF token works
Add a small paragraph explaining how BPF token works, how it's being
created and its relationship between the BPF filesystem.
Move all the relevant documentation in the PrivateBPF= section and let
point all the BPFDelegate* options to that one.
2025-07-10 21:40:07 +02:00
Yu Watanabe
f436c64e61 man: fix typo
Follow-up for 7baf403430.
2025-07-10 14:02:00 +09:00
Yu Watanabe
1cf5b39d64 core: add 'DefaultRestrictSUIDSGID' config option (#38126)
closes #37602, see there for extra motivation and considered
alternatives.

On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.

## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
    - VM system boots successfully
    - defaults apply (both `yes`, `no`, and undefined)
    - systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
2025-07-10 13:30:07 +09:00
Matteo Croce
7baf403430 man/systemd.exec: update documentation for PrivateBPF=
Add a short description about what PrivateBPF=yes does
and how it can be useful.
2025-07-10 01:57:14 +02:00
Grimmauld
0316fb8219 core: document 'DefaultRestrictSUIDSGID' 2025-07-09 21:45:46 +02:00
Matteo Croce
ea9826eb94 core: add options to delegate BPFFS token creation
Add four new options BPFDelegate{Commands,Maps,Programs,Attachments}=
in order to delegate to a BPFFS instance the permission to create tokens.

The value is a list of options taken from:
https://github.com/torvalds/linux/blob/v6.14/include/uapi/linux/bpf.h#L922-L1121
The special value "any" means to allow every possible values.

More informations about BPF tokens here:
https://lwn.net/Articles/947173/
2025-07-08 22:35:29 +02:00
Matteo Croce
3a47437fc9 core: Introduce PrivateBPF= to mount a private BPFFS
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
2025-07-08 22:33:28 +02:00
Andres Beltran
26c6f3271a core: add quota support for State, Cache, and Log exec directories 2025-07-07 17:28:47 +00:00
Lennart Poettering
2be3a06bb2 core: when PrivateDevices= is enabled and we need to decrypt TPM2 credentials, go via IPC
Also, if a device ACL list is defined, also go via IPC (instead of
trying to patch it, as before).

The outcome is that the tighter rules continue to apply when configured.

Fixes: #35959
2025-06-24 22:16:01 +02:00
Anton Ryzhov
bd02e15710 man/systemd-creds: fix documentation typo in systemd.exec.xml 2025-06-03 07:42:44 +09:00
Zbigniew Jędrzejewski-Szmek
b082968d19 man: better tags, more links, minor grammar and formatting improvements
Closes https://github.com/systemd/systemd/issues/35751.
2025-05-28 15:35:53 +02:00
Luca Boccassi
6946eed3fa core: Also refresh confext extensions when reloading notify-reload service (#33995)
`ExtensionImages=` and `ExtensionDirectories=` now let you specify
vpick-named extensions; however, since they just get set up once when
the service is started, you can't see newer versions without restarting
the service entirely. Here, also reload confext extensions when you
reload a service. This allows you to deploy a new version of some
configuration and have it picked up at reload time without interruption
to your workload.

Right now, we would only reload confext extensions and leave the sysext
ones behind, since it didn't seem prudent to swap out what is likely
program code at reload. This is made possible by only going for the
`SYSTEMD_CONFEXT_HIERARCHIES` overlays (which only contains `/etc`).

This PR:
- Adjusts `service.c` to also refresh extensions when needed. 
- Adds integration tests to check that a confext reload actually
occurred.
- Adds to the `systemd.exec` man pages to document this behavior.

This is a follow up to #24864 and #31364. Thank you to @bluca and
@goenkam for help in getting this up.
2025-05-20 11:27:34 +01:00
maia x.
67ecc2c7fe man: document confext reload behavior for ExtensionDirectories/Images 2025-05-19 13:36:21 +01:00
Lennart Poettering
bfb1f9e2c9 core: pass the socket cookie to invoked per-connection service instances as $SO_COOKIE env var
The socket cookie is just too useful for identifying connections, let's
emphasize this a bit and pass it as environment variable.
2025-05-15 09:45:32 +02:00
Lennart Poettering
3bdcd994cd man: correct version information when $REMOTE_ADDR/$REMOTE_PORT where added
This was in commit 3b1c524154, i.e. in the
v220 cycle.
2025-05-15 09:45:19 +02:00
Yu Watanabe
8ac5b047fc man/systemd.exec: update documents for PrivateTmp= 2025-05-11 03:33:02 +09:00
Zbigniew Jędrzejewski-Szmek
2dc4e87849 man/systemd.exec: reword description of RestrictAddressFamilies=
The text is reordered and broken into more paragraphs.
A recommendation to combine RestrictAddressFamilies= with
SystemCallFilter=@service is added.
2025-05-06 21:14:03 +02:00
Zbigniew Jędrzejewski-Szmek
802d23fcfb man/systemd.exec: reword description of SystemCallFilter=
The existing text grew organically as features were added and was
not very organized. Reorder it and break into paragraphs grouped
by topic. The description of the :errno syntax is replaced by a short
reference to the SystemCallErrorNumber= setting. This makes the
text shorter and makes it easier to explain how the two settings combine.
2025-05-06 21:14:03 +02:00
Yu Watanabe
4db8663b81 tree-wide: fix typo 2025-04-27 10:36:12 +09:00
Daan De Meyer
ba77798bba unit: Make sure individual unit maximum log level always takes priority
Currently LogLevelMax= can only be used to decrease the maximum log level
for a unit but not to increase it. Let's make sure the latter works as
well, so LogLevelMax=debug can be used to enable debug logging for specific
units without enabling debug logging globally.
2025-04-23 14:46:12 +02:00
Mike Yuan
32b69b190b core: delegate mountns implicitly when any of pidns/cgns/netns is in use 2025-03-30 18:57:18 +02:00
NetSysFire
1f0e4af329 systemd.exec(5): RestrictAddressFamilies: mention address_families(7) 2025-03-11 00:00:55 +09:00
Daan De Meyer
8234cd9989 core: Add DelegateNamespaces=
This delegates one or more namespaces to the service. Concretely,
this setting influences in which order we unshare namespaces. Delegated
namespaces are unshared *after* the user namespace is unshared. Other
namespaces are unshared *before* the user namespace is unshared.

Fixes #35369
2025-03-01 13:54:58 +01:00
Lennart Poettering
7933e971ce pid1: pass pidfdids to invoked services in $MAINPIDFDID and $MANAGERPIDFDID 2025-01-20 21:51:40 +01:00
Lennart Poettering
8af1b296cb pid1: when a password is requested during PAMName= processing, query it via the ask-password logic 2025-01-18 11:45:44 +00:00
Michal Sekletar
f1a0f311e6 man: adjust description of PrivateUsers= so it is in line with reality
When the option is not available unit will not even start so there is
no security risk.

Fixes #34983
2024-12-29 14:38:00 +09:00
Jan Engelhardt
c592ebdf4f man: grammar fixes for introductory adverbs/phrases 2024-12-25 17:24:38 +01:00
Jan Engelhardt
44855c77a1 man: expand word contractions
For written text, contractions are not normally used.
2024-12-25 17:00:31 +01:00
Jan Engelhardt
82ea392a99 man: grammar fixes for "regardless" 2024-12-25 17:00:31 +01:00
Lennart Poettering
4103bf9f2f man: document the new per-use credstore paths
(And some other minor tweaks)
2024-12-20 17:52:07 +01:00
Lennart Poettering
00a415fc8f tree-wide: remove support for kernels lacking ambient caps
Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.

This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
2024-12-17 17:34:46 +01:00
Yu Watanabe
e76fcd0e40 core: make ProtectHostname= optionally take a hostname
Closes #35623.
2024-12-16 23:55:44 +09:00
Luca Boccassi
6dfd290031 core: Add PrivateUsers=full (#35183)
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.

However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases,
users want a full identity mapping from 0 -> UID_MAX.

To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.

Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```

as the init user namedspace uses `0 0 UINT32_MAX` and some applications
- like systemd itself - determine if its a non-init user namespace based
on uid_map/gid_map files.

Note systemd will remove this heuristic in running_in_userns() in
version 258 (https://github.com/systemd/systemd/pull/35382) and uses
namespace inode. But some users may be running a container image with
older systemd < 258 so we keep this hack until version 259 for version
N-1 compatibility.

In addition to mapping the whole UID/GID space, we also set
/proc/pid/setgroups to "allow". While we usually set "deny" to avoid
security issues with dropping supplementary groups
(https://lwn.net/Articles/626665/), this ends up breaking dbus-broker
when running /sbin/init in full OS containers.

Fixes: #35168
Fixes: #35425
2024-12-13 12:25:13 +00:00
Ryan Wilson
2665425176 core: Set /proc/pid/setgroups to allow for PrivateUsers=full
When trying to run dbus-broker in a systemd unit with PrivateUsers=full,
we see dbus-broker fails with EPERM at `util_audit_drop_permissions`.

The root cause is dbus-broker calls the setgroups() system call and this
is disallowed via systemd's implementation of PrivateUsers= by setting
/proc/pid/setgroups = deny. This is done to remediate potential privilege
escalation vulnerabilities in user namespaces where an attacker can remove
supplementary groups and gain access to resources where those groups are
restricted.

However, for OS-like containers, setgroups() is a pretty common API and
disabling it is not feasible. So we allow setgroups() by setting
/proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious
users can still use SystemCallFilter= to disable setgroups() if they want
to specifically prevent this system call.

Fixes: #35425
2024-12-12 11:36:10 +00:00
Yu Watanabe
627d1a9ac1 core: Add ProtectHostname=private (#35447)
This PR allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host. This is useful for OS-like
containers running as units where they should have freedom to change
their container hostname if they want, but not the host's hostname.

Fixes: #30348
2024-12-11 10:17:25 +09:00
Ryan Wilson
219a6dbbf3 core: Fix time namespace in RestrictNamespaces=
RestrictNamespaces= would accept "time" but would not actually apply
seccomp filters e.g. systemd-run -p RestrictNamespaces=time unshare -T true
should fail but it succeeded.

This commit actually enables time namespace seccomp filtering.
2024-12-10 20:55:26 +01:00
Ryan Wilson
cf48bde7ae core: Add ProtectHostname=private
This allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host.

Fixes: #30348
2024-12-06 13:34:04 -08:00
Ryan Wilson
705cc82938 core: Add PrivateUsers=full
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.

However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases, users
want a full identity mapping from 0 -> UID_MAX.

Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```

as the init user namedspace uses `0 0 UINT32_MAX` and some applications -
like systemd itself - determine if its a non-init user namespace based on
uid_map/gid_map files. Note systemd will remove this heuristic in
running_in_userns() in version 258 and uses namespace inode. But some users
may be running a container image with older systemd < 258 so we keep this
hack until version 259.

To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.

Fixes: #35168
2024-12-05 10:34:32 -08:00
Septatrix
5857f31c2c man: clarify wording regarding MONITOR_* envs 2024-12-06 03:01:19 +09:00
Zbigniew Jędrzejewski-Szmek
fe45f8dc9b man: drop whitespace from final <programlisting> lines
In the troff output, this doesn't seem to make any difference. But in the
html output, the whitespace is sometimes preserved, creating an additional
gap before the following content. Drop it everywhere to avoid this.
2024-11-08 14:14:36 +01:00
Lennart Poettering
b711737096 man: document that PrivateTmp= is unaffected by ProtectSystem=strict
Fixes: #33130
2024-11-05 22:57:51 +01:00
Lennart Poettering
ecbe9ae5a0 man: don't claim SELinuxContext= only worked in the system service manager
Fixes: #34840
2024-11-05 22:42:38 +01:00
Daan De Meyer
406f177501 core: Introduce PrivatePIDs=
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.

Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.

We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.

When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.

Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.

Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.

Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
2024-11-05 05:32:02 -08:00
Andres Beltran
eae5127246 core: add id-mapped mount support for Exec directories 2024-11-01 18:45:28 +00:00
Luca Boccassi
890bdd1d77 core: add read-only flag for exec directories
When an exec directory is shared between services, this allows one of the
service to be the producer of files, and the other the consumer, without
letting the consumer modify the shared files.
This will be especially useful in conjunction with id-mapped exec directories
so that fully sandboxed services can share directories in one direction, safely.
2024-11-01 10:46:55 +00:00
Ryan Wilson
cd58b5a135 cgroup: Add support for ProtectControlGroups= private and strict
This commit adds two settings private and strict to
the ProtectControlGroups= property. Private will unshare the cgroup
namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup.
Strict does the same except the mount is read-only. Since the unit is
running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's
own cgroup.

We also add a new dbus property ProtectControlGroupsEx which accepts strings
instead of boolean. This will allow users to use private/strict via dbus
and systemd-run in addition to service files.

Note private and strict fall back to no and yes respectively if the kernel
doesn't support cgroup2 or system is not using unified hierarchy.

Fixes: #34634
2024-10-28 08:37:36 -07:00
Yu Watanabe
edd3f4d9b7 core: drop implicit support of PrivateUsers=off
Follow-up for fa693fdc7e.

The documentation says the option takes a boolean or one of the "self"
and "identity". But the parser uses private_users_from_string() which
also accepts "off". Let's drop the implicit support of "off".
2024-10-09 05:39:54 +09:00
Jason Yundt
dfb3155419 man: document ShowStatus and SetShowStatus()
SetShowStatus() was added in order to fix #11447. Recently, I ran into
the exact same problem that OP was experiencing in #11447. I wasn’t able
to figure out how to deal with the problem until I found #11447, and it
took me a while to find #11447.

This commit takes what I learned from reading #11447 and adds it to the
documentation. Hopefully, this will make it easier for other people who
run into the same problem in the future.
2024-09-18 10:11:55 +02:00
Daan De Meyer
fa693fdc7e core: Add support for PrivateUsers=identity
This configures an indentity mapping similar to
systemd-nspawn --private-users=identity.
2024-09-09 18:31:01 +02:00