systemd

mirror of https://github.com/morgan9e/systemd synced 2026-04-15 00:47:10 +09:00

Author	SHA1	Message	Date
Yu Watanabe	369f311686	man: fix typo Follow-up for `7aefb194e7`.	2025-07-11 14:11:04 +09:00
Matteo Croce	7aefb194e7	man/systemd.exec: explain how BPF token works Add a small paragraph explaining how BPF token works, how it's being created and its relationship between the BPF filesystem. Move all the relevant documentation in the PrivateBPF= section and let point all the BPFDelegate* options to that one.	2025-07-10 21:40:07 +02:00
Yu Watanabe	f436c64e61	man: fix typo Follow-up for `7baf403430`.	2025-07-10 14:02:00 +09:00
Yu Watanabe	1cf5b39d64	core: add 'DefaultRestrictSUIDSGID' config option (#38126 ) closes #37602, see there for extra motivation and considered alternatives. On typical systems, only few services need to create SUID/SGID files. This often is limited to the user explicitly setting suid/sgid, the `systemd-tmpfiles*` services, and the package manager. Allowing a default to globally restrict creation of suid/sgid files makes it easier to apply this restriction precisely. ## testing done - built on aarch64-linux and x86_64-linux - ran a VM test on x86_64-linux, checking for: - VM system boots successfully - defaults apply (both `yes`, `no`, and undefined) - systemd tmpfiles can set suid/sgid on journal log path - Other services explicitly defining `RestrictSUIDSGID=no` can create suid files	2025-07-10 13:30:07 +09:00
Matteo Croce	7baf403430	man/systemd.exec: update documentation for PrivateBPF= Add a short description about what PrivateBPF=yes does and how it can be useful.	2025-07-10 01:57:14 +02:00
Grimmauld	0316fb8219	core: document 'DefaultRestrictSUIDSGID'	2025-07-09 21:45:46 +02:00
Matteo Croce	ea9826eb94	core: add options to delegate BPFFS token creation Add four new options BPFDelegate{Commands,Maps,Programs,Attachments}= in order to delegate to a BPFFS instance the permission to create tokens. The value is a list of options taken from: https://github.com/torvalds/linux/blob/v6.14/include/uapi/linux/bpf.h#L922-L1121 The special value "any" means to allow every possible values. More informations about BPF tokens here: https://lwn.net/Articles/947173/	2025-07-08 22:35:29 +02:00
Matteo Croce	3a47437fc9	core: Introduce PrivateBPF= to mount a private BPFFS Add a new option PrivateBPF= to mount a new instance of bpffs within a namespace. PrivateBPF= can be set to "no" to use the host bpffs in readonly mode and "yes" to do a new mount. The mount is done with the new fsopen()/fsmount() API because in future we'll hook some commands between the two calls.	2025-07-08 22:33:28 +02:00
Andres Beltran	26c6f3271a	core: add quota support for State, Cache, and Log exec directories	2025-07-07 17:28:47 +00:00
Lennart Poettering	2be3a06bb2	core: when PrivateDevices= is enabled and we need to decrypt TPM2 credentials, go via IPC Also, if a device ACL list is defined, also go via IPC (instead of trying to patch it, as before). The outcome is that the tighter rules continue to apply when configured. Fixes: #35959	2025-06-24 22:16:01 +02:00
Anton Ryzhov	bd02e15710	man/systemd-creds: fix documentation typo in systemd.exec.xml	2025-06-03 07:42:44 +09:00
Zbigniew Jędrzejewski-Szmek	b082968d19	man: better tags, more links, minor grammar and formatting improvements Closes https://github.com/systemd/systemd/issues/35751.	2025-05-28 15:35:53 +02:00
Luca Boccassi	6946eed3fa	core: Also refresh confext extensions when reloading notify-reload service (#33995 ) `ExtensionImages=` and `ExtensionDirectories=` now let you specify vpick-named extensions; however, since they just get set up once when the service is started, you can't see newer versions without restarting the service entirely. Here, also reload confext extensions when you reload a service. This allows you to deploy a new version of some configuration and have it picked up at reload time without interruption to your workload. Right now, we would only reload confext extensions and leave the sysext ones behind, since it didn't seem prudent to swap out what is likely program code at reload. This is made possible by only going for the `SYSTEMD_CONFEXT_HIERARCHIES` overlays (which only contains `/etc`). This PR: - Adjusts `service.c` to also refresh extensions when needed. - Adds integration tests to check that a confext reload actually occurred. - Adds to the `systemd.exec` man pages to document this behavior. This is a follow up to #24864 and #31364. Thank you to @bluca and @goenkam for help in getting this up.	2025-05-20 11:27:34 +01:00
maia x.	67ecc2c7fe	man: document confext reload behavior for ExtensionDirectories/Images	2025-05-19 13:36:21 +01:00
Lennart Poettering	bfb1f9e2c9	core: pass the socket cookie to invoked per-connection service instances as $SO_COOKIE env var The socket cookie is just too useful for identifying connections, let's emphasize this a bit and pass it as environment variable.	2025-05-15 09:45:32 +02:00
Lennart Poettering	3bdcd994cd	man: correct version information when $REMOTE_ADDR/$REMOTE_PORT where added This was in commit `3b1c524154`, i.e. in the v220 cycle.	2025-05-15 09:45:19 +02:00
Yu Watanabe	8ac5b047fc	man/systemd.exec: update documents for PrivateTmp=	2025-05-11 03:33:02 +09:00
Zbigniew Jędrzejewski-Szmek	2dc4e87849	man/systemd.exec: reword description of RestrictAddressFamilies= The text is reordered and broken into more paragraphs. A recommendation to combine RestrictAddressFamilies= with SystemCallFilter=@service is added.	2025-05-06 21:14:03 +02:00
Zbigniew Jędrzejewski-Szmek	802d23fcfb	man/systemd.exec: reword description of SystemCallFilter= The existing text grew organically as features were added and was not very organized. Reorder it and break into paragraphs grouped by topic. The description of the :errno syntax is replaced by a short reference to the SystemCallErrorNumber= setting. This makes the text shorter and makes it easier to explain how the two settings combine.	2025-05-06 21:14:03 +02:00
Yu Watanabe	4db8663b81	tree-wide: fix typo	2025-04-27 10:36:12 +09:00
Daan De Meyer	ba77798bba	unit: Make sure individual unit maximum log level always takes priority Currently LogLevelMax= can only be used to decrease the maximum log level for a unit but not to increase it. Let's make sure the latter works as well, so LogLevelMax=debug can be used to enable debug logging for specific units without enabling debug logging globally.	2025-04-23 14:46:12 +02:00
Mike Yuan	32b69b190b	core: delegate mountns implicitly when any of pidns/cgns/netns is in use	2025-03-30 18:57:18 +02:00
NetSysFire	1f0e4af329	systemd.exec(5): RestrictAddressFamilies: mention address_families(7)	2025-03-11 00:00:55 +09:00
Daan De Meyer	8234cd9989	core: Add DelegateNamespaces= This delegates one or more namespaces to the service. Concretely, this setting influences in which order we unshare namespaces. Delegated namespaces are unshared after the user namespace is unshared. Other namespaces are unshared before the user namespace is unshared. Fixes #35369	2025-03-01 13:54:58 +01:00
Lennart Poettering	7933e971ce	pid1: pass pidfdids to invoked services in $MAINPIDFDID and $MANAGERPIDFDID	2025-01-20 21:51:40 +01:00
Lennart Poettering	8af1b296cb	pid1: when a password is requested during PAMName= processing, query it via the ask-password logic	2025-01-18 11:45:44 +00:00
Michal Sekletar	f1a0f311e6	man: adjust description of PrivateUsers= so it is in line with reality When the option is not available unit will not even start so there is no security risk. Fixes #34983	2024-12-29 14:38:00 +09:00
Jan Engelhardt	c592ebdf4f	man: grammar fixes for introductory adverbs/phrases	2024-12-25 17:24:38 +01:00
Jan Engelhardt	44855c77a1	man: expand word contractions For written text, contractions are not normally used.	2024-12-25 17:00:31 +01:00
Jan Engelhardt	82ea392a99	man: grammar fixes for "regardless"	2024-12-25 17:00:31 +01:00
Lennart Poettering	4103bf9f2f	man: document the new per-use credstore paths (And some other minor tweaks)	2024-12-20 17:52:07 +01:00
Lennart Poettering	00a415fc8f	tree-wide: remove support for kernels lacking ambient caps Let's bump the kernel baseline a bit to 4.3 and thus require ambient caps. This allows us to remove support for a variety of special casing, most importantly the ExecStart=!! hack.	2024-12-17 17:34:46 +01:00
Yu Watanabe	e76fcd0e40	core: make ProtectHostname= optionally take a hostname Closes #35623.	2024-12-16 23:55:44 +09:00
Luca Boccassi	6dfd290031	core: Add PrivateUsers=full (#35183 ) Recently, PrivateUsers=identity was added to support mapping the first 65536 UIDs/GIDs from parent to the child namespace and mapping the other UID/GIDs to the nobody user. However, there are use cases where users have UIDs/GIDs > 65536 and need to do a similar identity mapping. Moreover, in some of those cases, users want a full identity mapping from 0 -> UID_MAX. To support this, we add PrivateUsers=full that does identity mapping for all available UID/GIDs. Note to differentiate ourselves from the init user namespace, we need to set up the uid_map/gid_map like: ``` 0 0 1 1 1 UINT32_MAX - 1 ``` as the init user namedspace uses `0 0 UINT32_MAX` and some applications - like systemd itself - determine if its a non-init user namespace based on uid_map/gid_map files. Note systemd will remove this heuristic in running_in_userns() in version 258 (https://github.com/systemd/systemd/pull/35382) and uses namespace inode. But some users may be running a container image with older systemd < 258 so we keep this hack until version 259 for version N-1 compatibility. In addition to mapping the whole UID/GID space, we also set /proc/pid/setgroups to "allow". While we usually set "deny" to avoid security issues with dropping supplementary groups (https://lwn.net/Articles/626665/), this ends up breaking dbus-broker when running /sbin/init in full OS containers. Fixes: #35168 Fixes: #35425	2024-12-13 12:25:13 +00:00
Ryan Wilson	2665425176	core: Set /proc/pid/setgroups to allow for PrivateUsers=full When trying to run dbus-broker in a systemd unit with PrivateUsers=full, we see dbus-broker fails with EPERM at `util_audit_drop_permissions`. The root cause is dbus-broker calls the setgroups() system call and this is disallowed via systemd's implementation of PrivateUsers= by setting /proc/pid/setgroups = deny. This is done to remediate potential privilege escalation vulnerabilities in user namespaces where an attacker can remove supplementary groups and gain access to resources where those groups are restricted. However, for OS-like containers, setgroups() is a pretty common API and disabling it is not feasible. So we allow setgroups() by setting /proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious users can still use SystemCallFilter= to disable setgroups() if they want to specifically prevent this system call. Fixes: #35425	2024-12-12 11:36:10 +00:00
Yu Watanabe	627d1a9ac1	core: Add ProtectHostname=private (#35447 ) This PR allows an option for systemd exec units to enable UTS namespaces but not restrict changing hostname via seccomp. Thus, units can change hostname without affecting the host. This is useful for OS-like containers running as units where they should have freedom to change their container hostname if they want, but not the host's hostname. Fixes: #30348	2024-12-11 10:17:25 +09:00
Ryan Wilson	219a6dbbf3	core: Fix time namespace in RestrictNamespaces= RestrictNamespaces= would accept "time" but would not actually apply seccomp filters e.g. systemd-run -p RestrictNamespaces=time unshare -T true should fail but it succeeded. This commit actually enables time namespace seccomp filtering.	2024-12-10 20:55:26 +01:00
Ryan Wilson	cf48bde7ae	core: Add ProtectHostname=private This allows an option for systemd exec units to enable UTS namespaces but not restrict changing hostname via seccomp. Thus, units can change hostname without affecting the host. Fixes: #30348	2024-12-06 13:34:04 -08:00
Ryan Wilson	705cc82938	core: Add PrivateUsers=full Recently, PrivateUsers=identity was added to support mapping the first 65536 UIDs/GIDs from parent to the child namespace and mapping the other UID/GIDs to the nobody user. However, there are use cases where users have UIDs/GIDs > 65536 and need to do a similar identity mapping. Moreover, in some of those cases, users want a full identity mapping from 0 -> UID_MAX. Note to differentiate ourselves from the init user namespace, we need to set up the uid_map/gid_map like: ``` 0 0 1 1 1 UINT32_MAX - 1 ``` as the init user namedspace uses `0 0 UINT32_MAX` and some applications - like systemd itself - determine if its a non-init user namespace based on uid_map/gid_map files. Note systemd will remove this heuristic in running_in_userns() in version 258 and uses namespace inode. But some users may be running a container image with older systemd < 258 so we keep this hack until version 259. To support this, we add PrivateUsers=full that does identity mapping for all available UID/GIDs. Fixes: #35168	2024-12-05 10:34:32 -08:00
Septatrix	5857f31c2c	man: clarify wording regarding MONITOR_* envs	2024-12-06 03:01:19 +09:00
Zbigniew Jędrzejewski-Szmek	fe45f8dc9b	man: drop whitespace from final <programlisting> lines In the troff output, this doesn't seem to make any difference. But in the html output, the whitespace is sometimes preserved, creating an additional gap before the following content. Drop it everywhere to avoid this.	2024-11-08 14:14:36 +01:00
Lennart Poettering	b711737096	man: document that PrivateTmp= is unaffected by ProtectSystem=strict Fixes: #33130	2024-11-05 22:57:51 +01:00
Lennart Poettering	ecbe9ae5a0	man: don't claim SELinuxContext= only worked in the system service manager Fixes: #34840	2024-11-05 22:42:38 +01:00
Daan De Meyer	406f177501	core: Introduce PrivatePIDs= This new setting allows unsharing the pid namespace in a unit. Because you have to fork to get a process into a pid namespace, we fork in systemd-executor to get into the new pid namespace. The parent then sends the pid of the child process back to the manager and exits while the child process continues on with the rest of exec_invoke() and then executes the actual payload. Communicating the child pid is done via a new pidref socket pair that is set up on manager startup. We unshare the PID namespace right before the mount namespace so we mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes to mount procfs. When running unprivileged in a user session, user namespace is set up first to allow for PID namespace to be unshared. However, when running in privileged mode, we unshare the user namespace last to ensure the user namespace does not own the PID namespace and cannot break out of the sandbox. Note we disallow Type=forking services from using PrivatePIDs=yes since the init proess inside the PID namespace must not exit for other processes in the namespace to exist. Note Daan De Meyer did the original work for this commit with Ryan Wilson addressing follow-ups. Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>	2024-11-05 05:32:02 -08:00
Andres Beltran	eae5127246	core: add id-mapped mount support for Exec directories	2024-11-01 18:45:28 +00:00
Luca Boccassi	890bdd1d77	core: add read-only flag for exec directories When an exec directory is shared between services, this allows one of the service to be the producer of files, and the other the consumer, without letting the consumer modify the shared files. This will be especially useful in conjunction with id-mapped exec directories so that fully sandboxed services can share directories in one direction, safely.	2024-11-01 10:46:55 +00:00
Ryan Wilson	cd58b5a135	cgroup: Add support for ProtectControlGroups= private and strict This commit adds two settings private and strict to the ProtectControlGroups= property. Private will unshare the cgroup namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup. Strict does the same except the mount is read-only. Since the unit is running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's own cgroup. We also add a new dbus property ProtectControlGroupsEx which accepts strings instead of boolean. This will allow users to use private/strict via dbus and systemd-run in addition to service files. Note private and strict fall back to no and yes respectively if the kernel doesn't support cgroup2 or system is not using unified hierarchy. Fixes: #34634	2024-10-28 08:37:36 -07:00
Yu Watanabe	edd3f4d9b7	core: drop implicit support of PrivateUsers=off Follow-up for `fa693fdc7e`. The documentation says the option takes a boolean or one of the "self" and "identity". But the parser uses private_users_from_string() which also accepts "off". Let's drop the implicit support of "off".	2024-10-09 05:39:54 +09:00
Jason Yundt	dfb3155419	man: document ShowStatus and SetShowStatus() SetShowStatus() was added in order to fix #11447. Recently, I ran into the exact same problem that OP was experiencing in #11447. I wasn’t able to figure out how to deal with the problem until I found #11447, and it took me a while to find #11447. This commit takes what I learned from reading #11447 and adds it to the documentation. Hopefully, this will make it easier for other people who run into the same problem in the future.	2024-09-18 10:11:55 +02:00
Daan De Meyer	fa693fdc7e	core: Add support for PrivateUsers=identity This configures an indentity mapping similar to systemd-nspawn --private-users=identity.	2024-09-09 18:31:01 +02:00

1 2 3 4 5 ...

672 Commits