systemd

mirror of https://github.com/morgan9e/systemd synced 2026-04-14 08:25:20 +09:00

Author	SHA1	Message	Date
Lennart Poettering	00a415fc8f	tree-wide: remove support for kernels lacking ambient caps Let's bump the kernel baseline a bit to 4.3 and thus require ambient caps. This allows us to remove support for a variety of special casing, most importantly the ExecStart=!! hack.	2024-12-17 17:34:46 +01:00
Yu Watanabe	e76fcd0e40	core: make ProtectHostname= optionally take a hostname Closes #35623.	2024-12-16 23:55:44 +09:00
Yu Watanabe	0d298a771a	core/exec-invoke: fix ProtectHostname= value in log message Follow-up for `cf48bde7ae`.	2024-12-16 23:55:44 +09:00
Daan De Meyer	18bb30c3b2	core: Bind mount notify socket to /run/host/notify in sandboxed units (#35573 ) To be able to run systemd in a Type=notify transient unit, the notify socket can't be bind mounted to /run/systemd/notify as systemd in the transient unit wants to use that as its own notify socket which conflicts with systemd on the host. Instead, for sandboxed units, let's bind mount the notify socket to /run/host/notify as documented in the container interface. Since we don't guarantee a stable location for the notify socket and insist users use $NOTIFY_SOCKET to get its path, this is safe to do.	2024-12-13 13:48:07 +00:00
Daan De Meyer	284dd31e9d	core: Bind mount notify socket to /run/host/notify in sandboxed units To be able to run systemd in a Type=notify transient unit, the notify socket can't be bind mounted to /run/systemd/notify as systemd in the transient unit wants to use that as its own notify socket which conflicts with systemd on the host. Instead, for sandboxed units, let's bind mount the notify socket to /run/host/notify as documented in the container interface. Since we don't guarantee a stable location for the notify socket and insist users use $NOTIFY_SOCKET to get its path, this is safe to do.	2024-12-13 13:37:02 +01:00
Daan De Meyer	5575bf5fac	core/namespace: several fixes for recently merged PRs (#35580 ) Fixes #35546. Fixes #35566.	2024-12-13 12:34:11 +00:00
Luca Boccassi	6dfd290031	core: Add PrivateUsers=full (#35183 ) Recently, PrivateUsers=identity was added to support mapping the first 65536 UIDs/GIDs from parent to the child namespace and mapping the other UID/GIDs to the nobody user. However, there are use cases where users have UIDs/GIDs > 65536 and need to do a similar identity mapping. Moreover, in some of those cases, users want a full identity mapping from 0 -> UID_MAX. To support this, we add PrivateUsers=full that does identity mapping for all available UID/GIDs. Note to differentiate ourselves from the init user namespace, we need to set up the uid_map/gid_map like: ``` 0 0 1 1 1 UINT32_MAX - 1 ``` as the init user namedspace uses `0 0 UINT32_MAX` and some applications - like systemd itself - determine if its a non-init user namespace based on uid_map/gid_map files. Note systemd will remove this heuristic in running_in_userns() in version 258 (https://github.com/systemd/systemd/pull/35382) and uses namespace inode. But some users may be running a container image with older systemd < 258 so we keep this hack until version 259 for version N-1 compatibility. In addition to mapping the whole UID/GID space, we also set /proc/pid/setgroups to "allow". While we usually set "deny" to avoid security issues with dropping supplementary groups (https://lwn.net/Articles/626665/), this ends up breaking dbus-broker when running /sbin/init in full OS containers. Fixes: #35168 Fixes: #35425	2024-12-13 12:25:13 +00:00
Ryan Wilson	2665425176	core: Set /proc/pid/setgroups to allow for PrivateUsers=full When trying to run dbus-broker in a systemd unit with PrivateUsers=full, we see dbus-broker fails with EPERM at `util_audit_drop_permissions`. The root cause is dbus-broker calls the setgroups() system call and this is disallowed via systemd's implementation of PrivateUsers= by setting /proc/pid/setgroups = deny. This is done to remediate potential privilege escalation vulnerabilities in user namespaces where an attacker can remove supplementary groups and gain access to resources where those groups are restricted. However, for OS-like containers, setgroups() is a pretty common API and disabling it is not feasible. So we allow setgroups() by setting /proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious users can still use SystemCallFilter= to disable setgroups() if they want to specifically prevent this system call. Fixes: #35425	2024-12-12 11:36:10 +00:00
Yu Watanabe	2e6025b1b1	core/namespace: use ProtectHostname in NamespaceParameters To make the type of NamespaceParameters.protect_hostname consistent with the one in ExecContext. Addresses https://github.com/systemd/systemd/pull/35447#discussion_r1880372452. Fixes #35566.	2024-12-12 19:33:34 +09:00
Daan De Meyer	15816441ca	namespace: Rename notify_socket to host_notify_socket Preparation for next commit.	2024-12-11 19:08:38 +00:00
Ryan Wilson	cf48bde7ae	core: Add ProtectHostname=private This allows an option for systemd exec units to enable UTS namespaces but not restrict changing hostname via seccomp. Thus, units can change hostname without affecting the host. Fixes: #30348	2024-12-06 13:34:04 -08:00
Ryan Wilson	6746f28854	core: Migrate ProtectHostname to use enum vs boolean Migrating ProtectHostname to enum will set the stage for adding more properties like ProtectHostname=private in future commits. In addition, we add PrivateHostnameEx property to dbus API which uses string instead of boolean.	2024-12-06 13:33:49 -08:00
Ryan Wilson	705cc82938	core: Add PrivateUsers=full Recently, PrivateUsers=identity was added to support mapping the first 65536 UIDs/GIDs from parent to the child namespace and mapping the other UID/GIDs to the nobody user. However, there are use cases where users have UIDs/GIDs > 65536 and need to do a similar identity mapping. Moreover, in some of those cases, users want a full identity mapping from 0 -> UID_MAX. Note to differentiate ourselves from the init user namespace, we need to set up the uid_map/gid_map like: ``` 0 0 1 1 1 UINT32_MAX - 1 ``` as the init user namedspace uses `0 0 UINT32_MAX` and some applications - like systemd itself - determine if its a non-init user namespace based on uid_map/gid_map files. Note systemd will remove this heuristic in running_in_userns() in version 258 and uses namespace inode. But some users may be running a container image with older systemd < 258 so we keep this hack until version 259. To support this, we add PrivateUsers=full that does identity mapping for all available UID/GIDs. Fixes: #35168	2024-12-05 10:34:32 -08:00
Mike Yuan	b718b86e1b	core/exec-invoke: suppress placeholder home only in build_environment() Currently, get_fixed_user() employs USER_CREDS_SUPPRESS_PLACEHOLDER, meaning home path is set to NULL if it's empty or root. However, the path is also used for applying WorkingDirectory=~, and we'd spuriously use the invoking user's home as fallback even if User= is changed in that case. Let's instead delegate such suppression to build_environment(), so that home is proper initialized for usage at other steps. shell doesn't actually suffer from such problem, but it's changed too for consistency. Alternative to #34789	2024-11-19 00:38:18 +01:00
Mike Yuan	d911778877	core/exec-invoke: minor cleanup for apply_working_directory() error handling Assign exit_status at the same site where error log is emitted, for readability.	2024-11-19 00:38:18 +01:00
Mike Yuan	eea9d3eb10	basic/user-util: split out placeholder suppression from USER_CREDS_CLEAN into its own flag No functional change, preparation for later commits.	2024-11-19 00:38:18 +01:00
Ivan Kruglov	c0589b0227	use report_errno_and_exit() in src/core/exec-invoke.c	2024-11-06 11:18:38 +01:00
Daan De Meyer	406f177501	core: Introduce PrivatePIDs= This new setting allows unsharing the pid namespace in a unit. Because you have to fork to get a process into a pid namespace, we fork in systemd-executor to get into the new pid namespace. The parent then sends the pid of the child process back to the manager and exits while the child process continues on with the rest of exec_invoke() and then executes the actual payload. Communicating the child pid is done via a new pidref socket pair that is set up on manager startup. We unshare the PID namespace right before the mount namespace so we mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes to mount procfs. When running unprivileged in a user session, user namespace is set up first to allow for PID namespace to be unshared. However, when running in privileged mode, we unshare the user namespace last to ensure the user namespace does not own the PID namespace and cannot break out of the sandbox. Note we disallow Type=forking services from using PrivatePIDs=yes since the init proess inside the PID namespace must not exit for other processes in the namespace to exist. Note Daan De Meyer did the original work for this commit with Ryan Wilson addressing follow-ups. Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>	2024-11-05 05:32:02 -08:00
Daan De Meyer	89fdca7168	exec-invoke: Add debug logging for setup_private_users()	2024-11-04 09:19:36 -08:00
Andres Beltran	eae5127246	core: add id-mapped mount support for Exec directories	2024-11-01 18:45:28 +00:00
Luca Boccassi	890bdd1d77	core: add read-only flag for exec directories When an exec directory is shared between services, this allows one of the service to be the producer of files, and the other the consumer, without letting the consumer modify the shared files. This will be especially useful in conjunction with id-mapped exec directories so that fully sandboxed services can share directories in one direction, safely.	2024-11-01 10:46:55 +00:00
Lennart Poettering	2ef87de9d3	core: add EXEC_DIRECTORY_TYPE_SHALL_CHOWN() helper Let's make ConfigurationDirectory= a bit less "special-casey", by hiding the fact that it's the only per-service dir we do not do chown()ing for inside of a new EXEC_DIRECTORY_TYPE_SHALL_CHOWN() helper.	2024-10-30 13:33:29 +01:00
Ryan Wilson	cd58b5a135	cgroup: Add support for ProtectControlGroups= private and strict This commit adds two settings private and strict to the ProtectControlGroups= property. Private will unshare the cgroup namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup. Strict does the same except the mount is read-only. Since the unit is running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's own cgroup. We also add a new dbus property ProtectControlGroupsEx which accepts strings instead of boolean. This will allow users to use private/strict via dbus and systemd-run in addition to service files. Note private and strict fall back to no and yes respectively if the kernel doesn't support cgroup2 or system is not using unified hierarchy. Fixes: #34634	2024-10-28 08:37:36 -07:00
Ryan Wilson	5fe2923828	core: Refactor ProtectControlGroups= to use enum vs bool This commit refactors ProtectControlGroups= from using a boolean in the dbus/execute backend to using an enum. There is no functional change but this will allow adding new non-boolean values (e.g. strict, private) a la PrivateHome.	2024-10-28 06:42:53 -07:00
Ryan Wilson	141dfbe537	core: Add RootDirectory= path to error message if directory does not exist A colleague reported when RootDirectory= does not exist, systemd reports an error like: ``` Failed to set up mount namespacing: No such file or directory ``` Unfortunately, with large spec files, it can be hard to diagnose which path systemd is talking about. Thus, to make the error message more helpful and similar to mount error messages, we add the root directory/image path into the error message like: ``` Failed to set up mount namespacing: /tmp/thisdoesnotexist: No such file or directory ```	2024-10-26 15:33:30 -07:00
Ryan Wilson	e73c042be6	core/execute: Rename error_path -> reterr_path/ret_path per coding guidelines This is a non-functional change to ensure error_path used to print out the offending mount causing an error follows coding guidelines.	2024-10-26 15:28:49 -07:00
Yu Watanabe	7354936ef7	core/cgroup: rename CGROUP_PRESSURE_WATCH_ON/OFF -> CGROUP_PRESSURE_WATCH_YES/NO No functional change, but let's print yes/no rather than on/off in systemd-analyze. Similar to `2e8a581b9c` and `edd3f4d9b7`. (Note, the commit messages of those commits are wrong, as parse_boolean() supports on/off anyway.)	2024-10-27 03:04:35 +09:00
Lennart Poettering	e4b4d9cc7a	core: make sure that if PAMName= is set we always do the full user changing even if no user is specified explicitly When PAMName= is set this should be enough to go through our entire user changing story, so that PAM is definitely run, and environment variables definitely pulled in and so on. Previously, it would happen that under some circumstances we might no do this when transitioning from root to root itself even though PAM was enabled. Fixes: #34682	2024-10-24 22:37:00 +02:00
Łukasz Stelmach	20bbf5ee4c	core: don't forget about fallback_smack_process_label Call setup_smack() also when only fallback_smack_process_label is set. Fixes: `75689fb2d4`	2024-10-24 03:24:29 +09:00
Yu Watanabe	2e8a581b9c	core: drop implicit support of PrivateTmp=off Follow-up for `0e551b04ef`. Similar to the previous commit, but for PrivateTmp=.	2024-10-09 08:11:42 +09:00
Yu Watanabe	edd3f4d9b7	core: drop implicit support of PrivateUsers=off Follow-up for `fa693fdc7e`. The documentation says the option takes a boolean or one of the "self" and "identity". But the parser uses private_users_from_string() which also accepts "off". Let's drop the implicit support of "off".	2024-10-09 05:39:54 +09:00
Ryan Wilson	3543456f84	Add ExtraFileDescriptor property to StartTransientUnit dbus API This adds the ExtraFileDescriptor property to StartTransient dbus API with format "a(hs)" - array of (file descriptor, name) pairs. The FD will be passed to the unit via sd_notify like Socket and OpenFile. systemctl show also shows ExtraFileDescriptorName for these transient units. We only show the name passed to dbus as the FD numbers will change once passed over the unix socket and are duplicated, so its confusing to display the numbers. We do not add this functionality for systemd-run or general systemd service units as it is not useful for general systemd services. Arguably, it could be useful for systemd-run in bash scripts but we prefer to be cautious and not expose the API yet. Fixes: #34396	2024-10-07 09:01:48 -07:00
Mike Yuan	3f8999a76e	fs-util: rename laccess to access_nofollow In order to distinguish it from libc function naming.	2024-10-05 01:30:43 +02:00
Daan De Meyer	fa693fdc7e	core: Add support for PrivateUsers=identity This configures an indentity mapping similar to systemd-nspawn --private-users=identity.	2024-09-09 18:31:01 +02:00
Lennart Poettering	41902bacc3	Merge pull request #34256 from YHNdnzj/pid1-followup core: follow-ups for recent PRs	2024-09-05 17:01:10 +02:00
Mike Yuan	7a9f0125bb	core: rename BindJournalSockets= to BindLogSockets= Addresses https://github.com/systemd/systemd/pull/32487#issuecomment-2328465309	2024-09-04 21:44:25 +02:00
Mike Yuan	7583859ba8	core/exec-invoke: use bind_mount_add() where appropriate	2024-09-04 21:44:24 +02:00
Daan De Meyer	b1cfa93080	copy: Introduce COPY_NOCOW_AFTER and use it when copying images When dealing with copying COW images, we have to make a tradeoff: - Either we don't touch the NOCOW bit on the copied file COW and get an instant copy because we're able to reflink, but we might get reduced performance if the source file was COW as COW files and lots of random writes don't play well together. - Or we force NOCOW for the copied file, which means we have to do a full copy as reflinking from COW files to NOCOW files or vice versa is not supported. In exec-invoke.c, we've opted for the first option. In nspawn.c and discover-image.c, we've opted for the second option. In nspawn, this applies to the --ephemeral option to make ephemeral copies. In discover-image.c, this applies to cloning images into /var/lib/machines. Both these features might be used to run many machines of the same original image. We really don't want to force a full copy onto users in these scenarios when they're expecting reflink behavior, leading to them running out of disk space. Instead, degraded performance in their machines is a much less severe issue, which they will discover on their own if it affects them, at which point they can make their original image NOCOW at which point they'll get both the reflinks and better performance. Given the above reasoning, let's switch nspawn.c and discover-image.c to use COPY_NOCOW_AFTER as well instead of enabling NOCOW upfront and forcing a copy if the original source image is COW.	2024-09-04 19:23:16 +02:00
Daan De Meyer	519216b71f	Revert "tree-wide: Don't explicity disable copy-on-write when copying images" Let's still try to disable COW after copying. It won't do much, but it doesn't hurt either. See https://github.com/systemd/systemd/pull/33825/files#r1727288871. This reverts commit `42e9288180`.	2024-09-04 18:49:05 +02:00
Mike Yuan	368a3071e9	core: introduce BindJournalSockets= Closes #32478	2024-09-03 21:04:50 +02:00
Lennart Poettering	9c0aee7cbb	exec-invoke: remove redundant empty lines	2024-08-27 16:20:23 +02:00
Luca Boccassi	7d8bbfbe08	service: add 'debug' option to RestartMode= One of the major pait points of managing fleets of headless nodes is that when something fails at startup, unless debug level was already enabled (which usually isn't, as it's a firehose), one needs to manually enable it and pray the issue can be reproduced, which often is really hard and time consuming, just to get extra info. Usually the extra log messages are enough to triage an issue. This new option makes it so that when a service fails and is restarted due to Restart=, log level for that unit is set to debug, so that all setup code in pid1 and sd-executor logs at debug level, and also a new DEBUG_INVOCATION=1 env var is passed to the service itself, so that it knows it should start with a higher log level. Once the unit succeeds or reaches the rate limit the original level is restored.	2024-08-27 12:24:45 +01:00
Ivan Shapovalov	b73c86c695	core/exec-invoke: document calling setpriority() after sched_setattr() Fixes: `711a157738` ("core/exec-invoke: call setpriority() after sched_setattr()")	2024-08-22 04:25:29 +09:00
Lennart Poettering	300b7e7620	tree-wide: use isatty_safe() more	2024-08-20 11:11:53 +02:00
Ivan Shapovalov	711a157738	core/exec-invoke: call setpriority() after sched_setattr() The nice value is part of struct sched_attr, and consequently invoking sched_setattr() after setpriority() would clobber the nice value with the default (as we are not setting it in struct sched_attr). It would be best to combine both calls, but for now simply invoke setpriority() after sched_setattr() to make sure Nice= remains effective when used together with CPUSchedulingPolicy=.	2024-08-10 19:09:14 +02:00
Mike Yuan	3386f66200	cgroup-setup: drop unused cg_migrate_callback for cg_attach_everywhere() While at it, move the typedef from cgroup-util to -setup.	2024-08-02 14:47:39 +02:00
Łukasz Stelmach	18d51ec876	Revert "execute: Call capability_ambient_set_apply even if ambient set is 0" With ambient capabilities being dropped at the start of process managers (both system and user) as well as systemd-executor it isn't necessary to drop them here. Moreover, at this point also the inheritable set can be preserved. This makes it possible to assign a user session manager inheritable capabilities which combined with file capabilites (ei sets) of service executables enable running user services with capabilities but only when started by the manager. This reverts commit `943800f4e7`.	2024-07-31 11:09:58 +02:00
Daan De Meyer	42e9288180	tree-wide: Don't explicity disable copy-on-write when copying images Since the copy helpers now copy file attributes as well, let's not explicitly disable copy-on-write anymore when we copy an image. If the source already has copy-on-write disabled, the copy will have it disabled as well. Otherwise, the copy will also have copy-on-write enabled. This makes sure that reflinks always work as reflink is only supported if both source and target are copy-on-write or both source and target are not copy-on-write.	2024-07-25 11:56:07 +02:00
Lennart Poettering	e846854172	execute: add FIXME comment As requested by @YHNdnzj: https://github.com/systemd/systemd/pull/33707#discussion_r1684055699	2024-07-19 18:59:01 +02:00
Lennart Poettering	e2d66781ee	exec-invoke: user EBADF where appropriate	2024-07-19 11:44:04 +02:00

1 2 3

134 Commits