Closes#3829
Alternative to #35417
I don't think the individual "WasOnDependencyCycle" attrs on units
are particularly helpful and comprehensible, as it's really about
the dep relationship between them. And as discussed, the dependency
cycle is not something persistent, rather local to the currently
loaded set of units and shall be reset with daemon-reload (see also
https://github.com/systemd/systemd/issues/35642#issuecomment-2591296586).
Hence, let's report system state as degraded and point users to
the involved transactions when ordering cycles are encountered instead.
Combined with log messages added in 6912eb315f
it should achieve the goal of making ordering cycles more observable,
while avoiding all sorts of subtle bookkeeping in the service manager.
The degraded state can be reset via the existing ResetFailed() manager-wide
method.
When Type=notify-reload got introduced, it wasn't intended to be
mutually exclusive with ExecReload=. However, currently ExecReload=
is immediately forked off after the service main process is signaled,
leaving states in between essentially undefined. Given so broken
it is I doubt any sane user is using this setup, hence I took a stab
to rework everything:
1. Extensions are refreshed (unchanged)
2. ExecReload= is forked off without signaling the process
3a. If RELOADING=1 is sent during the ExecReload= invocation,
we'd refrain from signaling the process again, instead
just transition to SERVICE_RELOAD_NOTIFY directly and
wait for READY=1
3b. If not, signal the process after ExecReload= finishes
(from now on the same as Type=notify-reload w/o ExecReload=)
4. To accomodate the use case of performing post-reload tasks,
ExecReloadPost= is introduced which executes after READY=1
The new model greatly simplifies things, as no control processes
will be around in SERVICE_RELOAD_SIGNAL and SERVICE_RELOAD_NOTIFY
states.
See also: https://github.com/systemd/systemd/issues/37515#issuecomment-2891229652
This allows a service to reuse the user namespace created for an
existing service, similarly to NetworkNamespacePath=. The configuration
is the initial user namespace (e.g. ID mapping) is preserved.
closes#37602, see there for extra motivation and considered
alternatives.
On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.
## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
- VM system boots successfully
- defaults apply (both `yes`, `no`, and undefined)
- systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
Alternative to b50f6dbe57
The commit naively returned early from socket_enter_running(), which however
is quite problematic, as the socket will be woken up over and over again
without doing a thing, until we eventually hit Poll/TriggerLimit*=.
On top of that it requires hacks to hold the start job for initrd-switch-root.service
up. Overall I doubt that is the right approach.
Let's instead hook this into our job engine, and try to activate
the service again when some other units are stopped. If all installed
jobs have been run yet we're still seeing the conflict or the manually
selected timeout is reached, fail the socket as before.
This controls the new SO_PASSRIGHTS socket option in kernel v6.16.
Note that I intentionally choose a different naming scheme than
Pass*=, since all other Pass*= options controls whether some extra
bits are attached to the message, while this one's about denying
file descriptor transfer and it feels more explicit this way.
And diverging from underlying socket option name is precedented
by Timestamping=. But happy to change it to just say PassRights=
if people disagree.
Our baseline is v5.4 and cgroup v2 is enforced now,
which means CPU accounting is cheap everywhere without
requiring any controller, hence just remove the directive.
The method is deprecated since 64f173324e
(v257) and announced that it will be removed in v258.
Let's remove it now.
This effectively reverts 84c01612de.
This delegates one or more namespaces to the service. Concretely,
this setting influences in which order we unshare namespaces. Delegated
namespaces are unshared *after* the user namespace is unshared. Other
namespaces are unshared *before* the user namespace is unshared.
Fixes#35369
09fbff57fc introduced new knob
for such functionality. However, that seems unnecessary.
The mount option string is ubiquitous in that all of fstab,
kernel cmdline, credentials, systemd-mount, ... speak it.
And we already have x-systemd.device-bound= that's parsed
by pid1 instead of fstab-generator. It feels hence more natural
for graceful options to be an extension of that, rather than
its own property.
There's also one nice side effect that the setting itself
is now more graceful for systemd versions not supporting
such feature.
This new setting can be used to specify mount options that shall only be
added to the mount option string if the kernel supports them.
This shall be used for adding "usrquota" to tmp.mount without breaking compat,
but is generally be useful.
When running unprivileged containers, we run into a scenario where an
unpriv owned cgroup has a subcgroup delegated to another user (i.e. the
container's own UIDs). When the owner of that cgroup dies without
cleaning it up then the unpriv service manager might encounter a cgroup
it cannot delete anymore.
Let's address that: let's expose a method call on the service manager
(primarly in PID1) that can be used to delete a subcgroup of a unit one
owns. This would then allow the unpriv service manager to ask the priv
service manager to get rid of such a cgroup.
This commit only adds the method call, the next commit then adds the
code that makes use of this.
Migrating ProtectHostname to enum will set the stage for adding more
properties like ProtectHostname=private in future commits.
In addition, we add PrivateHostnameEx property to dbus API which uses
string instead of boolean.
Some ambiguity (e.g., same-named man pages in multiple volumes)
makes it impossible to fully automate this, but the following
Python snippet (run inside the man/ directory of the systemd repo)
helped to generate the sed command lines (which were subsequently
manually reviewed, run and the false positives reverted):
from pathlib import Path
import lxml
from lxml import etree as ET
man2vol: dict[str, str] = {}
man2citerefs: dict[str, list] = {}
for file in Path(".").glob("*.xml"):
tree = ET.parse(file, lxml.etree.XMLParser(recover=True))
meta = tree.find("refmeta")
if meta is not None:
title = meta.findtext("refentrytitle")
if title is not None:
vol = meta.findtext("manvolnum")
if vol is not None:
man2vol[title] = vol
citerefs = list(tree.iter("citerefentry"))
if citerefs:
man2citerefs[title] = citerefs
for man, refs in man2citerefs.items():
for ref in refs:
title = ref.findtext("refentrytitle")
if title is not None:
has = ref.findtext("manvolnum")
try:
should_have = man2vol[title]
except KeyError: # Non-systemd man page reference? Ignore.
continue
if has != should_have:
print(
f"sed -i '\\|<citerefentry><refentrytitle>{title}"
f"</refentrytitle><manvolnum>{has}</manvolnum>"
f"</citerefentry>|s|<manvolnum>{has}</manvolnum>|"
f"<manvolnum>{should_have}</manvolnum>|' {man}.xml"
)
In the troff output, this doesn't seem to make any difference. But in the
html output, the whitespace is sometimes preserved, creating an additional
gap before the following content. Drop it everywhere to avoid this.
Let's systematically make sure that we link up the D-Bus interfaces from
the daemon man pages once in prose and once in short form at the bottom
("See Also"), for all daemons.
Also, add reverse links at the bottom of the D-Bus API docs.
Fixes: #34996
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.
Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.
We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.
When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.
Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.
Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.
Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
When an exec directory is shared between services, this allows one of the
service to be the producer of files, and the other the consumer, without
letting the consumer modify the shared files.
This will be especially useful in conjunction with id-mapped exec directories
so that fully sandboxed services can share directories in one direction, safely.
This commit adds two settings private and strict to
the ProtectControlGroups= property. Private will unshare the cgroup
namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup.
Strict does the same except the mount is read-only. Since the unit is
running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's
own cgroup.
We also add a new dbus property ProtectControlGroupsEx which accepts strings
instead of boolean. This will allow users to use private/strict via dbus
and systemd-run in addition to service files.
Note private and strict fall back to no and yes respectively if the kernel
doesn't support cgroup2 or system is not using unified hierarchy.
Fixes: #34634
This will allow units (scopes/slices/services) to override the default
systemd-oomd setting DefaultMemoryPressureDurationSec=.
The semantics of ManagedOOMMemoryPressureDurationSec= are:
- If >= 1 second, overrides DefaultMemoryPressureDurationSec= from oomd.conf
- If is empty, uses DefaultMemoryPressureDurationSec= from oomd.conf
- Ignored if ManagedOOMMemoryPressure= is not "kill"
- Disallowed if < 1 second
Note the corresponding dbus property is DefaultMemoryPressureDurationUSec
which is in microseconds. This is consistent with other time-based
dbus properties.
By default, in instances where timers are running on a realtime schedule,
if a service takes longer to run than the interval of a timer, the
service will immediately start again when the previous invocation finishes.
This is caused by the fact that the next elapse is calculated based on
the last trigger time, which, combined with the fact that the interval
is shorter than the runtime of the service, causes that elapse to be in
the past, which in turn means the timer will trigger as soon as the
service finishes running.
This behavior can be changed by enabling the new DeferReactivation setting,
which will cause the next calendar elapse to be calculated based on when
the trigger unit enters inactivity, rather than the last trigger time.
Thus, if a timer is on an realtime interval, the trigger will always
adhere to that specified interval.
E.g. if you have a timer that runs on a minutely interval, the setting
guarantees that triggers will happen at *:*:00 times, whereas by default
this may skew depending on how long the service runs.
Co-authored-by: Matteo Croce <teknoraver@meta.com>
The method was added with migration of resources in mind (e.g. process's
allocated memory will follow it to the new scope), however, such a
resource migration is not in cgroup semantics. The method may thus have
the intended users and others could be guided to StartTransientUnit().
Since this API was advertised in a regular release, start the removal
with a deprecation message to callers.
Eventually, the goal is to remove the method to clean up DBus API and
simplify code (removal of cgroup_context_copy()).
Part of DBus docs is retained to satisfy build checks.
This adds the ExtraFileDescriptor property to StartTransient dbus API
with format "a(hs)" - array of (file descriptor, name) pairs. The FD
will be passed to the unit via sd_notify like Socket and OpenFile.
systemctl show also shows ExtraFileDescriptorName for these transient
units. We only show the name passed to dbus as the FD numbers will
change once passed over the unix socket and are duplicated, so its
confusing to display the numbers.
We do not add this functionality for systemd-run or general systemd
service units as it is not useful for general systemd services.
Arguably, it could be useful for systemd-run in bash scripts but we
prefer to be cautious and not expose the API yet.
Fixes: #34396