Commit Graph

134 Commits

Author SHA1 Message Date
Lennart Poettering
00a415fc8f tree-wide: remove support for kernels lacking ambient caps
Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.

This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
2024-12-17 17:34:46 +01:00
Yu Watanabe
e76fcd0e40 core: make ProtectHostname= optionally take a hostname
Closes #35623.
2024-12-16 23:55:44 +09:00
Yu Watanabe
0d298a771a core/exec-invoke: fix ProtectHostname= value in log message
Follow-up for cf48bde7ae.
2024-12-16 23:55:44 +09:00
Daan De Meyer
18bb30c3b2 core: Bind mount notify socket to /run/host/notify in sandboxed units (#35573)
To be able to run systemd in a Type=notify transient unit, the notify
socket can't be bind mounted to /run/systemd/notify as systemd in the
transient unit wants to use that as its own notify socket which
conflicts with systemd on the host.

Instead, for sandboxed units, let's bind mount the notify socket to
/run/host/notify as documented in the container interface. Since we
don't guarantee a stable location for the notify socket and insist users
use $NOTIFY_SOCKET to get its path, this is safe to do.
2024-12-13 13:48:07 +00:00
Daan De Meyer
284dd31e9d core: Bind mount notify socket to /run/host/notify in sandboxed units
To be able to run systemd in a Type=notify transient unit, the notify
socket can't be bind mounted to /run/systemd/notify as systemd in the
transient unit wants to use that as its own notify socket which conflicts
with systemd on the host.

Instead, for sandboxed units, let's bind mount the notify socket to
/run/host/notify as documented in the container interface. Since we don't
guarantee a stable location for the notify socket and insist users use
$NOTIFY_SOCKET to get its path, this is safe to do.
2024-12-13 13:37:02 +01:00
Daan De Meyer
5575bf5fac core/namespace: several fixes for recently merged PRs (#35580)
Fixes #35546.
Fixes #35566.
2024-12-13 12:34:11 +00:00
Luca Boccassi
6dfd290031 core: Add PrivateUsers=full (#35183)
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.

However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases,
users want a full identity mapping from 0 -> UID_MAX.

To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.

Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```

as the init user namedspace uses `0 0 UINT32_MAX` and some applications
- like systemd itself - determine if its a non-init user namespace based
on uid_map/gid_map files.

Note systemd will remove this heuristic in running_in_userns() in
version 258 (https://github.com/systemd/systemd/pull/35382) and uses
namespace inode. But some users may be running a container image with
older systemd < 258 so we keep this hack until version 259 for version
N-1 compatibility.

In addition to mapping the whole UID/GID space, we also set
/proc/pid/setgroups to "allow". While we usually set "deny" to avoid
security issues with dropping supplementary groups
(https://lwn.net/Articles/626665/), this ends up breaking dbus-broker
when running /sbin/init in full OS containers.

Fixes: #35168
Fixes: #35425
2024-12-13 12:25:13 +00:00
Ryan Wilson
2665425176 core: Set /proc/pid/setgroups to allow for PrivateUsers=full
When trying to run dbus-broker in a systemd unit with PrivateUsers=full,
we see dbus-broker fails with EPERM at `util_audit_drop_permissions`.

The root cause is dbus-broker calls the setgroups() system call and this
is disallowed via systemd's implementation of PrivateUsers= by setting
/proc/pid/setgroups = deny. This is done to remediate potential privilege
escalation vulnerabilities in user namespaces where an attacker can remove
supplementary groups and gain access to resources where those groups are
restricted.

However, for OS-like containers, setgroups() is a pretty common API and
disabling it is not feasible. So we allow setgroups() by setting
/proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious
users can still use SystemCallFilter= to disable setgroups() if they want
to specifically prevent this system call.

Fixes: #35425
2024-12-12 11:36:10 +00:00
Yu Watanabe
2e6025b1b1 core/namespace: use ProtectHostname in NamespaceParameters
To make the type of NamespaceParameters.protect_hostname consistent
with the one in ExecContext.

Addresses https://github.com/systemd/systemd/pull/35447#discussion_r1880372452.
Fixes #35566.
2024-12-12 19:33:34 +09:00
Daan De Meyer
15816441ca namespace: Rename notify_socket to host_notify_socket
Preparation for next commit.
2024-12-11 19:08:38 +00:00
Ryan Wilson
cf48bde7ae core: Add ProtectHostname=private
This allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host.

Fixes: #30348
2024-12-06 13:34:04 -08:00
Ryan Wilson
6746f28854 core: Migrate ProtectHostname to use enum vs boolean
Migrating ProtectHostname to enum will set the stage for adding more
properties like ProtectHostname=private in future commits.

In addition, we add PrivateHostnameEx property to dbus API which uses
string instead of boolean.
2024-12-06 13:33:49 -08:00
Ryan Wilson
705cc82938 core: Add PrivateUsers=full
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.

However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases, users
want a full identity mapping from 0 -> UID_MAX.

Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```

as the init user namedspace uses `0 0 UINT32_MAX` and some applications -
like systemd itself - determine if its a non-init user namespace based on
uid_map/gid_map files. Note systemd will remove this heuristic in
running_in_userns() in version 258 and uses namespace inode. But some users
may be running a container image with older systemd < 258 so we keep this
hack until version 259.

To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.

Fixes: #35168
2024-12-05 10:34:32 -08:00
Mike Yuan
b718b86e1b core/exec-invoke: suppress placeholder home only in build_environment()
Currently, get_fixed_user() employs USER_CREDS_SUPPRESS_PLACEHOLDER,
meaning home path is set to NULL if it's empty or root. However,
the path is also used for applying WorkingDirectory=~, and we'd
spuriously use the invoking user's home as fallback even if
User= is changed in that case.

Let's instead delegate such suppression to build_environment(),
so that home is proper initialized for usage at other steps.
shell doesn't actually suffer from such problem, but it's changed
too for consistency.

Alternative to #34789
2024-11-19 00:38:18 +01:00
Mike Yuan
d911778877 core/exec-invoke: minor cleanup for apply_working_directory() error handling
Assign exit_status at the same site where error log is emitted,
for readability.
2024-11-19 00:38:18 +01:00
Mike Yuan
eea9d3eb10 basic/user-util: split out placeholder suppression from USER_CREDS_CLEAN into its own flag
No functional change, preparation for later commits.
2024-11-19 00:38:18 +01:00
Ivan Kruglov
c0589b0227 use report_errno_and_exit() in src/core/exec-invoke.c 2024-11-06 11:18:38 +01:00
Daan De Meyer
406f177501 core: Introduce PrivatePIDs=
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.

Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.

We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.

When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.

Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.

Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.

Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
2024-11-05 05:32:02 -08:00
Daan De Meyer
89fdca7168 exec-invoke: Add debug logging for setup_private_users() 2024-11-04 09:19:36 -08:00
Andres Beltran
eae5127246 core: add id-mapped mount support for Exec directories 2024-11-01 18:45:28 +00:00
Luca Boccassi
890bdd1d77 core: add read-only flag for exec directories
When an exec directory is shared between services, this allows one of the
service to be the producer of files, and the other the consumer, without
letting the consumer modify the shared files.
This will be especially useful in conjunction with id-mapped exec directories
so that fully sandboxed services can share directories in one direction, safely.
2024-11-01 10:46:55 +00:00
Lennart Poettering
2ef87de9d3 core: add EXEC_DIRECTORY_TYPE_SHALL_CHOWN() helper
Let's make ConfigurationDirectory= a bit less "special-casey", by hiding
the fact that it's the only per-service dir we do not do chown()ing for
inside of a new EXEC_DIRECTORY_TYPE_SHALL_CHOWN() helper.
2024-10-30 13:33:29 +01:00
Ryan Wilson
cd58b5a135 cgroup: Add support for ProtectControlGroups= private and strict
This commit adds two settings private and strict to
the ProtectControlGroups= property. Private will unshare the cgroup
namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup.
Strict does the same except the mount is read-only. Since the unit is
running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's
own cgroup.

We also add a new dbus property ProtectControlGroupsEx which accepts strings
instead of boolean. This will allow users to use private/strict via dbus
and systemd-run in addition to service files.

Note private and strict fall back to no and yes respectively if the kernel
doesn't support cgroup2 or system is not using unified hierarchy.

Fixes: #34634
2024-10-28 08:37:36 -07:00
Ryan Wilson
5fe2923828 core: Refactor ProtectControlGroups= to use enum vs bool
This commit refactors ProtectControlGroups= from using a boolean
in the dbus/execute backend to using an enum. There is no functional
change but this will allow adding new non-boolean values (e.g. strict,
private) a la PrivateHome.
2024-10-28 06:42:53 -07:00
Ryan Wilson
141dfbe537 core: Add RootDirectory= path to error message if directory does not exist
A colleague reported when RootDirectory= does not exist, systemd reports an error like:
```
Failed to set up mount namespacing: No such file or directory
```

Unfortunately, with large spec files, it can be hard to diagnose which path systemd is talking
about. Thus, to make the error message more helpful and similar to mount error messages, we add
the root directory/image path into the error message like:
```
Failed to set up mount namespacing: /tmp/thisdoesnotexist: No such file or directory
```
2024-10-26 15:33:30 -07:00
Ryan Wilson
e73c042be6 core/execute: Rename error_path -> reterr_path/ret_path per coding guidelines
This is a non-functional change to ensure error_path used to print out the
offending mount causing an error follows coding guidelines.
2024-10-26 15:28:49 -07:00
Yu Watanabe
7354936ef7 core/cgroup: rename CGROUP_PRESSURE_WATCH_ON/OFF -> CGROUP_PRESSURE_WATCH_YES/NO
No functional change, but let's print yes/no rather than on/off in systemd-analyze.

Similar to 2e8a581b9c and
edd3f4d9b7.
(Note, the commit messages of those commits are wrong, as
 parse_boolean() supports on/off anyway.)
2024-10-27 03:04:35 +09:00
Lennart Poettering
e4b4d9cc7a core: make sure that if PAMName= is set we always do the full user changing even if no user is specified explicitly
When PAMName= is set this should be enough to go through our entire user
changing story, so that PAM is definitely run, and environment variables
definitely pulled in and so on.

Previously, it would happen that under some circumstances we might no do
this when transitioning from root to root itself even though PAM was
enabled.

Fixes: #34682
2024-10-24 22:37:00 +02:00
Łukasz Stelmach
20bbf5ee4c core: don't forget about fallback_smack_process_label
Call setup_smack() also when only fallback_smack_process_label is set.

Fixes: 75689fb2d4
2024-10-24 03:24:29 +09:00
Yu Watanabe
2e8a581b9c core: drop implicit support of PrivateTmp=off
Follow-up for 0e551b04ef.

Similar to the previous commit, but for PrivateTmp=.
2024-10-09 08:11:42 +09:00
Yu Watanabe
edd3f4d9b7 core: drop implicit support of PrivateUsers=off
Follow-up for fa693fdc7e.

The documentation says the option takes a boolean or one of the "self"
and "identity". But the parser uses private_users_from_string() which
also accepts "off". Let's drop the implicit support of "off".
2024-10-09 05:39:54 +09:00
Ryan Wilson
3543456f84 Add ExtraFileDescriptor property to StartTransientUnit dbus API
This adds the ExtraFileDescriptor property to StartTransient dbus API
with format "a(hs)" - array of (file descriptor, name) pairs. The FD
will be passed to the unit via sd_notify like Socket and OpenFile.

systemctl show also shows ExtraFileDescriptorName for these transient
units. We only show the name passed to dbus as the FD numbers will
change once passed over the unix socket and are duplicated, so its
confusing to display the numbers.

We do not add this functionality for systemd-run or general systemd
service units as it is not useful for general systemd services.
Arguably, it could be useful for systemd-run in bash scripts but we
prefer to be cautious and not expose the API yet.

Fixes: #34396
2024-10-07 09:01:48 -07:00
Mike Yuan
3f8999a76e fs-util: rename laccess to access_nofollow
In order to distinguish it from libc function naming.
2024-10-05 01:30:43 +02:00
Daan De Meyer
fa693fdc7e core: Add support for PrivateUsers=identity
This configures an indentity mapping similar to
systemd-nspawn --private-users=identity.
2024-09-09 18:31:01 +02:00
Lennart Poettering
41902bacc3 Merge pull request #34256 from YHNdnzj/pid1-followup
core: follow-ups for recent PRs
2024-09-05 17:01:10 +02:00
Mike Yuan
7a9f0125bb core: rename BindJournalSockets= to BindLogSockets=
Addresses https://github.com/systemd/systemd/pull/32487#issuecomment-2328465309
2024-09-04 21:44:25 +02:00
Mike Yuan
7583859ba8 core/exec-invoke: use bind_mount_add() where appropriate 2024-09-04 21:44:24 +02:00
Daan De Meyer
b1cfa93080 copy: Introduce COPY_NOCOW_AFTER and use it when copying images
When dealing with copying COW images, we have to make a tradeoff:

- Either we don't touch the NOCOW bit on the copied file COW and get
  an instant copy because we're able to reflink, but we might get
  reduced performance if the source file was COW as COW files and lots
  of random writes don't play well together.
- Or we force NOCOW for the copied file, which means we have to do a
  full copy as reflinking from COW files to NOCOW files or vice versa
  is not supported.

In exec-invoke.c, we've opted for the first option. In nspawn.c and
discover-image.c, we've opted for the second option.

In nspawn, this applies to the --ephemeral option to make ephemeral
copies. In discover-image.c, this applies to cloning images into
/var/lib/machines. Both these features might be used to run many
machines of the same original image. We really don't want to force
a full copy onto users in these scenarios when they're expecting
reflink behavior, leading to them running out of disk space. Instead,
degraded performance in their machines is a much less severe issue,
which they will discover on their own if it affects them, at which
point they can make their original image NOCOW at which point they'll
get both the reflinks and better performance.

Given the above reasoning, let's switch nspawn.c and discover-image.c
to use COPY_NOCOW_AFTER as well instead of enabling NOCOW upfront and
forcing a copy if the original source image is COW.
2024-09-04 19:23:16 +02:00
Daan De Meyer
519216b71f Revert "tree-wide: Don't explicity disable copy-on-write when copying images"
Let's still try to disable COW after copying. It won't do much, but
it doesn't hurt either.

See https://github.com/systemd/systemd/pull/33825/files#r1727288871.

This reverts commit 42e9288180.
2024-09-04 18:49:05 +02:00
Mike Yuan
368a3071e9 core: introduce BindJournalSockets=
Closes #32478
2024-09-03 21:04:50 +02:00
Lennart Poettering
9c0aee7cbb exec-invoke: remove redundant empty lines 2024-08-27 16:20:23 +02:00
Luca Boccassi
7d8bbfbe08 service: add 'debug' option to RestartMode=
One of the major pait points of managing fleets of headless nodes is
that when something fails at startup, unless debug level was already
enabled (which usually isn't, as it's a firehose), one needs to manually
enable it and pray the issue can be reproduced, which often is really
hard and time consuming, just to get extra info. Usually the extra log
messages are enough to triage an issue.

This new option makes it so that when a service fails and is restarted
due to Restart=, log level for that unit is set to debug, so that all
setup code in pid1 and sd-executor logs at debug level, and also a new
DEBUG_INVOCATION=1 env var is passed to the service itself, so that it
knows it should start with a higher log level. Once the unit succeeds
or reaches the rate limit the original level is restored.
2024-08-27 12:24:45 +01:00
Ivan Shapovalov
b73c86c695 core/exec-invoke: document calling setpriority() after sched_setattr()
Fixes: 711a157738 ("core/exec-invoke: call setpriority() after sched_setattr()")
2024-08-22 04:25:29 +09:00
Lennart Poettering
300b7e7620 tree-wide: use isatty_safe() more 2024-08-20 11:11:53 +02:00
Ivan Shapovalov
711a157738 core/exec-invoke: call setpriority() after sched_setattr()
The nice value is part of struct sched_attr, and consequently invoking
sched_setattr() after setpriority() would clobber the nice value with
the default (as we are not setting it in struct sched_attr).

It would be best to combine both calls, but for now simply invoke
setpriority() after sched_setattr() to make sure Nice= remains effective
when used together with CPUSchedulingPolicy=.
2024-08-10 19:09:14 +02:00
Mike Yuan
3386f66200 cgroup-setup: drop unused cg_migrate_callback for cg_attach_everywhere()
While at it, move the typedef from cgroup-util to -setup.
2024-08-02 14:47:39 +02:00
Łukasz Stelmach
18d51ec876 Revert "execute: Call capability_ambient_set_apply even if ambient set is 0"
With ambient capabilities being dropped at the start of process managers
(both system and user) as well as systemd-executor it isn't necessary
to drop them here. Moreover, at this point also the inheritable set can
be preserved. This makes it possible to assign a user session manager
inheritable capabilities which combined with file capabilites (ei sets)
of service executables enable running user services with capabilities
but only when started by the manager.

This reverts commit 943800f4e7.
2024-07-31 11:09:58 +02:00
Daan De Meyer
42e9288180 tree-wide: Don't explicity disable copy-on-write when copying images
Since the copy helpers now copy file attributes as well, let's not
explicitly disable copy-on-write anymore when we copy an image. If
the source already has copy-on-write disabled, the copy will have it
disabled as well. Otherwise, the copy will also have copy-on-write
enabled.

This makes sure that reflinks always work as reflink is only supported
if both source and target are copy-on-write or both source and target
are not copy-on-write.
2024-07-25 11:56:07 +02:00
Lennart Poettering
e846854172 execute: add FIXME comment
As requested by @YHNdnzj:

https://github.com/systemd/systemd/pull/33707#discussion_r1684055699
2024-07-19 18:59:01 +02:00
Lennart Poettering
e2d66781ee exec-invoke: user EBADF where appropriate 2024-07-19 11:44:04 +02:00