Let's bump the kernel baseline a bit to 4.3 and thus require ambient
caps.
This allows us to remove support for a variety of special casing, most
importantly the ExecStart=!! hack.
To be able to run systemd in a Type=notify transient unit, the notify
socket can't be bind mounted to /run/systemd/notify as systemd in the
transient unit wants to use that as its own notify socket which
conflicts with systemd on the host.
Instead, for sandboxed units, let's bind mount the notify socket to
/run/host/notify as documented in the container interface. Since we
don't guarantee a stable location for the notify socket and insist users
use $NOTIFY_SOCKET to get its path, this is safe to do.
To be able to run systemd in a Type=notify transient unit, the notify
socket can't be bind mounted to /run/systemd/notify as systemd in the
transient unit wants to use that as its own notify socket which conflicts
with systemd on the host.
Instead, for sandboxed units, let's bind mount the notify socket to
/run/host/notify as documented in the container interface. Since we don't
guarantee a stable location for the notify socket and insist users use
$NOTIFY_SOCKET to get its path, this is safe to do.
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.
However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases,
users want a full identity mapping from 0 -> UID_MAX.
To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.
Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```
as the init user namedspace uses `0 0 UINT32_MAX` and some applications
- like systemd itself - determine if its a non-init user namespace based
on uid_map/gid_map files.
Note systemd will remove this heuristic in running_in_userns() in
version 258 (https://github.com/systemd/systemd/pull/35382) and uses
namespace inode. But some users may be running a container image with
older systemd < 258 so we keep this hack until version 259 for version
N-1 compatibility.
In addition to mapping the whole UID/GID space, we also set
/proc/pid/setgroups to "allow". While we usually set "deny" to avoid
security issues with dropping supplementary groups
(https://lwn.net/Articles/626665/), this ends up breaking dbus-broker
when running /sbin/init in full OS containers.
Fixes: #35168Fixes: #35425
When trying to run dbus-broker in a systemd unit with PrivateUsers=full,
we see dbus-broker fails with EPERM at `util_audit_drop_permissions`.
The root cause is dbus-broker calls the setgroups() system call and this
is disallowed via systemd's implementation of PrivateUsers= by setting
/proc/pid/setgroups = deny. This is done to remediate potential privilege
escalation vulnerabilities in user namespaces where an attacker can remove
supplementary groups and gain access to resources where those groups are
restricted.
However, for OS-like containers, setgroups() is a pretty common API and
disabling it is not feasible. So we allow setgroups() by setting
/proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious
users can still use SystemCallFilter= to disable setgroups() if they want
to specifically prevent this system call.
Fixes: #35425
This allows an option for systemd exec units to enable UTS namespaces
but not restrict changing hostname via seccomp. Thus, units can change
hostname without affecting the host.
Fixes: #30348
Migrating ProtectHostname to enum will set the stage for adding more
properties like ProtectHostname=private in future commits.
In addition, we add PrivateHostnameEx property to dbus API which uses
string instead of boolean.
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.
However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases, users
want a full identity mapping from 0 -> UID_MAX.
Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```
as the init user namedspace uses `0 0 UINT32_MAX` and some applications -
like systemd itself - determine if its a non-init user namespace based on
uid_map/gid_map files. Note systemd will remove this heuristic in
running_in_userns() in version 258 and uses namespace inode. But some users
may be running a container image with older systemd < 258 so we keep this
hack until version 259.
To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.
Fixes: #35168
Currently, get_fixed_user() employs USER_CREDS_SUPPRESS_PLACEHOLDER,
meaning home path is set to NULL if it's empty or root. However,
the path is also used for applying WorkingDirectory=~, and we'd
spuriously use the invoking user's home as fallback even if
User= is changed in that case.
Let's instead delegate such suppression to build_environment(),
so that home is proper initialized for usage at other steps.
shell doesn't actually suffer from such problem, but it's changed
too for consistency.
Alternative to #34789
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.
Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.
We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.
When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.
Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.
Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.
Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
When an exec directory is shared between services, this allows one of the
service to be the producer of files, and the other the consumer, without
letting the consumer modify the shared files.
This will be especially useful in conjunction with id-mapped exec directories
so that fully sandboxed services can share directories in one direction, safely.
Let's make ConfigurationDirectory= a bit less "special-casey", by hiding
the fact that it's the only per-service dir we do not do chown()ing for
inside of a new EXEC_DIRECTORY_TYPE_SHALL_CHOWN() helper.
This commit adds two settings private and strict to
the ProtectControlGroups= property. Private will unshare the cgroup
namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup.
Strict does the same except the mount is read-only. Since the unit is
running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's
own cgroup.
We also add a new dbus property ProtectControlGroupsEx which accepts strings
instead of boolean. This will allow users to use private/strict via dbus
and systemd-run in addition to service files.
Note private and strict fall back to no and yes respectively if the kernel
doesn't support cgroup2 or system is not using unified hierarchy.
Fixes: #34634
This commit refactors ProtectControlGroups= from using a boolean
in the dbus/execute backend to using an enum. There is no functional
change but this will allow adding new non-boolean values (e.g. strict,
private) a la PrivateHome.
A colleague reported when RootDirectory= does not exist, systemd reports an error like:
```
Failed to set up mount namespacing: No such file or directory
```
Unfortunately, with large spec files, it can be hard to diagnose which path systemd is talking
about. Thus, to make the error message more helpful and similar to mount error messages, we add
the root directory/image path into the error message like:
```
Failed to set up mount namespacing: /tmp/thisdoesnotexist: No such file or directory
```
No functional change, but let's print yes/no rather than on/off in systemd-analyze.
Similar to 2e8a581b9c and
edd3f4d9b7.
(Note, the commit messages of those commits are wrong, as
parse_boolean() supports on/off anyway.)
When PAMName= is set this should be enough to go through our entire user
changing story, so that PAM is definitely run, and environment variables
definitely pulled in and so on.
Previously, it would happen that under some circumstances we might no do
this when transitioning from root to root itself even though PAM was
enabled.
Fixes: #34682
Follow-up for fa693fdc7e.
The documentation says the option takes a boolean or one of the "self"
and "identity". But the parser uses private_users_from_string() which
also accepts "off". Let's drop the implicit support of "off".
This adds the ExtraFileDescriptor property to StartTransient dbus API
with format "a(hs)" - array of (file descriptor, name) pairs. The FD
will be passed to the unit via sd_notify like Socket and OpenFile.
systemctl show also shows ExtraFileDescriptorName for these transient
units. We only show the name passed to dbus as the FD numbers will
change once passed over the unix socket and are duplicated, so its
confusing to display the numbers.
We do not add this functionality for systemd-run or general systemd
service units as it is not useful for general systemd services.
Arguably, it could be useful for systemd-run in bash scripts but we
prefer to be cautious and not expose the API yet.
Fixes: #34396
When dealing with copying COW images, we have to make a tradeoff:
- Either we don't touch the NOCOW bit on the copied file COW and get
an instant copy because we're able to reflink, but we might get
reduced performance if the source file was COW as COW files and lots
of random writes don't play well together.
- Or we force NOCOW for the copied file, which means we have to do a
full copy as reflinking from COW files to NOCOW files or vice versa
is not supported.
In exec-invoke.c, we've opted for the first option. In nspawn.c and
discover-image.c, we've opted for the second option.
In nspawn, this applies to the --ephemeral option to make ephemeral
copies. In discover-image.c, this applies to cloning images into
/var/lib/machines. Both these features might be used to run many
machines of the same original image. We really don't want to force
a full copy onto users in these scenarios when they're expecting
reflink behavior, leading to them running out of disk space. Instead,
degraded performance in their machines is a much less severe issue,
which they will discover on their own if it affects them, at which
point they can make their original image NOCOW at which point they'll
get both the reflinks and better performance.
Given the above reasoning, let's switch nspawn.c and discover-image.c
to use COPY_NOCOW_AFTER as well instead of enabling NOCOW upfront and
forcing a copy if the original source image is COW.
One of the major pait points of managing fleets of headless nodes is
that when something fails at startup, unless debug level was already
enabled (which usually isn't, as it's a firehose), one needs to manually
enable it and pray the issue can be reproduced, which often is really
hard and time consuming, just to get extra info. Usually the extra log
messages are enough to triage an issue.
This new option makes it so that when a service fails and is restarted
due to Restart=, log level for that unit is set to debug, so that all
setup code in pid1 and sd-executor logs at debug level, and also a new
DEBUG_INVOCATION=1 env var is passed to the service itself, so that it
knows it should start with a higher log level. Once the unit succeeds
or reaches the rate limit the original level is restored.
The nice value is part of struct sched_attr, and consequently invoking
sched_setattr() after setpriority() would clobber the nice value with
the default (as we are not setting it in struct sched_attr).
It would be best to combine both calls, but for now simply invoke
setpriority() after sched_setattr() to make sure Nice= remains effective
when used together with CPUSchedulingPolicy=.
With ambient capabilities being dropped at the start of process managers
(both system and user) as well as systemd-executor it isn't necessary
to drop them here. Moreover, at this point also the inheritable set can
be preserved. This makes it possible to assign a user session manager
inheritable capabilities which combined with file capabilites (ei sets)
of service executables enable running user services with capabilities
but only when started by the manager.
This reverts commit 943800f4e7.
Since the copy helpers now copy file attributes as well, let's not
explicitly disable copy-on-write anymore when we copy an image. If
the source already has copy-on-write disabled, the copy will have it
disabled as well. Otherwise, the copy will also have copy-on-write
enabled.
This makes sure that reflinks always work as reflink is only supported
if both source and target are copy-on-write or both source and target
are not copy-on-write.