248 Commits

Author SHA1 Message Date
Mike Yuan
58686034eb core: expose transactions with ordering cycle
Closes #3829
Alternative to #35417

I don't think the individual "WasOnDependencyCycle" attrs on units
are particularly helpful and comprehensible, as it's really about
the dep relationship between them. And as discussed, the dependency
cycle is not something persistent, rather local to the currently
loaded set of units and shall be reset with daemon-reload (see also
https://github.com/systemd/systemd/issues/35642#issuecomment-2591296586).

Hence, let's report system state as degraded and point users to
the involved transactions when ordering cycles are encountered instead.
Combined with log messages added in 6912eb315f
it should achieve the goal of making ordering cycles more observable,
while avoiding all sorts of subtle bookkeeping in the service manager.
The degraded state can be reset via the existing ResetFailed() manager-wide
method.
2025-11-12 23:47:39 +01:00
Mike Yuan
b03e1b09af core/service: rework ExecReload= + Type=notify-reload interaction, add ExecReloadPost=
When Type=notify-reload got introduced, it wasn't intended to be
mutually exclusive with ExecReload=. However, currently ExecReload=
is immediately forked off after the service main process is signaled,
leaving states in between essentially undefined. Given so broken
it is I doubt any sane user is using this setup, hence I took a stab
to rework everything:

1.  Extensions are refreshed (unchanged)
2.  ExecReload= is forked off without signaling the process
3a. If RELOADING=1 is sent during the ExecReload= invocation,
    we'd refrain from signaling the process again, instead
    just transition to SERVICE_RELOAD_NOTIFY directly and
    wait for READY=1
3b. If not, signal the process after ExecReload= finishes
    (from now on the same as Type=notify-reload w/o ExecReload=)
4.  To accomodate the use case of performing post-reload tasks,
    ExecReloadPost= is introduced which executes after READY=1

The new model greatly simplifies things, as no control processes
will be around in SERVICE_RELOAD_SIGNAL and SERVICE_RELOAD_NOTIFY
states.

See also: https://github.com/systemd/systemd/issues/37515#issuecomment-2891229652
2025-11-04 12:18:33 +01:00
Mike Yuan
48632305c7 man/org.freedesktop.systemd1: fix typo (ExecStop -> -Post) 2025-11-04 12:17:33 +01:00
Quentin Deslandes
79dd24cf14 core: Add UserNamespacePath=
This allows a service to reuse the user namespace created for an
existing service, similarly to NetworkNamespacePath=. The configuration
is the initial user namespace (e.g. ID mapping) is preserved.
2025-11-04 10:55:04 +01:00
Yu Watanabe
63b27d1765 man: mention about UINT64_MAX value for OOMKills and ManagedOOMKills DBus properties
Follow-up for 9cf6ad16dd.
Addresses https://github.com/systemd/systemd/pull/38906#discussion_r2363291116.
2025-10-05 23:09:31 +09:00
Daan De Meyer
9cf6ad16dd core: Expose oom kills and managed oom kills as properties
It can be useful for users to know this information so let's expose
it as properties so it can be queried.
2025-09-19 13:54:54 +02:00
Brett Holman
04abe03189 man: correct the number of active unit states 2025-07-28 20:32:48 +01:00
Lennart Poettering
03b4a607f6 core: followups for the recent subgroup killing commits
This is a follow-up for 0f23564ad4 and
6b02854f50, as suggested here:

https://github.com/systemd/systemd/pull/37855#pullrequestreview-2997596953
2025-07-10 13:32:51 +09:00
Yu Watanabe
1cf5b39d64 core: add 'DefaultRestrictSUIDSGID' config option (#38126)
closes #37602, see there for extra motivation and considered
alternatives.

On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.

## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
    - VM system boots successfully
    - defaults apply (both `yes`, `no`, and undefined)
    - systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
2025-07-10 13:30:07 +09:00
Grimmauld
97998d1cbe core/dbus-manager: Support 'DefaultRestrictSUIDSGID' option 2025-07-09 21:45:38 +02:00
Matteo Croce
ea9826eb94 core: add options to delegate BPFFS token creation
Add four new options BPFDelegate{Commands,Maps,Programs,Attachments}=
in order to delegate to a BPFFS instance the permission to create tokens.

The value is a list of options taken from:
https://github.com/torvalds/linux/blob/v6.14/include/uapi/linux/bpf.h#L922-L1121
The special value "any" means to allow every possible values.

More informations about BPF tokens here:
https://lwn.net/Articles/947173/
2025-07-08 22:35:29 +02:00
Matteo Croce
3a47437fc9 core: Introduce PrivateBPF= to mount a private BPFFS
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
2025-07-08 22:33:28 +02:00
Lennart Poettering
0f23564ad4 pid1: add ability to kill processes in a subgroup of a unit
This is useful for things like machined, where the system machined wants
to manage a machine owned by the user somewhere down the tree.
2025-07-08 03:14:53 +02:00
Andres Beltran
26c6f3271a core: add quota support for State, Cache, and Log exec directories 2025-07-07 17:28:47 +00:00
Lennart Poettering
d03714e4e4 tree-wide: "human readable" → "human-readable"
Apparently, the spelling with a hyphen is better style in the English
language.

Suggested by: #36165
2025-07-07 11:21:25 +02:00
Mike Yuan
1b4ab5a209 core/socket: introduce DeferTrigger= and DeferTriggerMaxSec=
Alternative to b50f6dbe57

The commit naively returned early from socket_enter_running(), which however
is quite problematic, as the socket will be woken up over and over again
without doing a thing, until we eventually hit Poll/TriggerLimit*=.
On top of that it requires hacks to hold the start job for initrd-switch-root.service
up. Overall I doubt that is the right approach.

Let's instead hook this into our job engine, and try to activate
the service again when some other units are stopped. If all installed
jobs have been run yet we're still seeing the conflict or the manually
selected timeout is reached, fail the socket as before.
2025-06-30 13:10:43 +02:00
Lennart Poettering
99fd08224f man: add proper version info for RandomizedOffsetUSec
Follow-up for: #36437
Fixes: #37870
2025-06-26 17:28:49 +02:00
Lennart Poettering
279962a9e8 core/timer: Introduce RandomOffsetSec= knob (#36437)
This is like RandomDelaySec, but it doesn't reset whenever the manager
restarts.

Fixes https://github.com/systemd/systemd/issues/21166
2025-06-17 16:05:12 +02:00
Mike Yuan
5c12797fc3 core/socket: introduce AcceptFileDescriptors=
This controls the new SO_PASSRIGHTS socket option in kernel v6.16.
Note that I intentionally choose a different naming scheme than
Pass*=, since all other Pass*= options controls whether some extra
bits are attached to the message, while this one's about denying
file descriptor transfer and it feels more explicit this way.
And diverging from underlying socket option name is precedented
by Timestamping=. But happy to change it to just say PassRights=
if people disagree.
2025-06-17 13:16:42 +02:00
Mike Yuan
35462aa14a core/socket: add PassPIDFD= 2025-06-17 13:16:41 +02:00
Mike Yuan
b36ab0d4ce core/socket: don't suggest PassFileDescriptorsToExec= is a socket option
by not interleaving it among socket options.
2025-06-17 13:16:07 +02:00
Mike Yuan
29da53dde3 core: always enable CPU accounting
Our baseline is v5.4 and cgroup v2 is enforced now,
which means CPU accounting is cheap everywhere without
requiring any controller, hence just remove the directive.
2025-05-15 02:19:16 +02:00
Lennart Poettering
9ca16f6f18 pid1: add a concurrency limit to slice units
Fixes: #35862
2025-04-22 18:53:51 +02:00
Yu Watanabe
8c35e8a9d2 core: remove deprecated StartAuxiliaryScope() DBus method
The method is deprecated since 64f173324e
(v257) and announced that it will be removed in v258.
Let's remove it now.

This effectively reverts 84c01612de.
2025-04-22 09:02:45 +09:00
Yu Watanabe
db6986e02c core: deprecate CGroup v1 DBus properties 2025-04-15 22:34:22 +09:00
Luca Boccassi
b065ff03b1 man: fix typo in org.freedesktop.systemd1.xml 2025-03-24 18:25:29 +00:00
Daan De Meyer
8234cd9989 core: Add DelegateNamespaces=
This delegates one or more namespaces to the service. Concretely,
this setting influences in which order we unshare namespaces. Delegated
namespaces are unshared *after* the user namespace is unshared. Other
namespaces are unshared *before* the user namespace is unshared.

Fixes #35369
2025-03-01 13:54:58 +01:00
Adrian Vovk
9a0749c82b core/timer: Introduce RandomOffsetSec= knob
This is like RandomDelaySec, but it doesn't reset whenever the manager
restarts.

Fixes https://github.com/systemd/systemd/issues/21166
2025-02-18 19:16:57 -05:00
Marco Trevisan (Treviño)
bd887a75d4 man/org.freedesktop.systemd1.xml: Clarify the behavior of Subscribe()
It was unclear that it was applied to standard signals too, and this
lead to unexpected behavior.

See: https://github.com/systemd/systemd/pull/36366
2025-02-18 09:56:11 +00:00
Mike Yuan
0d76f1c423 core/mount: rework GracefulOptions= to be just x-systemd.graceful-option=
09fbff57fc introduced new knob
for such functionality. However, that seems unnecessary.

The mount option string is ubiquitous in that all of fstab,
kernel cmdline, credentials, systemd-mount, ... speak it.
And we already have x-systemd.device-bound= that's parsed
by pid1 instead of fstab-generator. It feels hence more natural
for graceful options to be an extension of that, rather than
its own property.

There's also one nice side effect that the setting itself
is now more graceful for systemd versions not supporting
such feature.
2025-02-12 18:16:44 +01:00
Mike Yuan
0fa062f983 core/dbus-mount: add missing ReloadResult and CleanResult properties 2025-02-12 15:34:54 +01:00
Lennart Poettering
09fbff57fc pid1: add GracefulOptions= setting to .mount units
This new setting can be used to specify mount options that shall only be
added to the mount option string if the kernel supports them.

This shall be used for adding "usrquota" to tmp.mount without breaking compat,
but is generally be useful.
2025-01-15 21:05:06 +01:00
Lennart Poettering
390dffb862 man: also fix documentation of start-limit-hit 2025-01-15 10:42:10 +01:00
Lennart Poettering
94634b4b03 pid1: add D-Bus API for removing delegated subcgroups
When running unprivileged containers, we run into a scenario where an
unpriv owned cgroup has a subcgroup delegated to another user (i.e. the
container's own UIDs). When the owner of that cgroup dies without
cleaning it up then the unpriv service manager might encounter a cgroup
it cannot delete anymore.

Let's address that: let's expose a method call on the service manager
(primarly in PID1) that can be used to delete a subcgroup of a unit one
owns. This would then allow the unpriv service manager to ask the priv
service manager to get rid of such a cgroup.

This commit only adds the method call, the next commit then adds the
code that makes use of this.
2025-01-08 15:27:25 +01:00
Jan Engelhardt
44855c77a1 man: expand word contractions
For written text, contractions are not normally used.
2024-12-25 17:00:31 +01:00
Jan Engelhardt
91dc2a52f5 man: grammar fixes: replace "respectively"
Unlike the German "bzw.", "respectively" cannot be used as an infix,
and is not abbreviated either.
2024-12-25 17:00:26 +01:00
Yu Watanabe
e76fcd0e40 core: make ProtectHostname= optionally take a hostname
Closes #35623.
2024-12-16 23:55:44 +09:00
Ryan Wilson
6746f28854 core: Migrate ProtectHostname to use enum vs boolean
Migrating ProtectHostname to enum will set the stage for adding more
properties like ProtectHostname=private in future commits.

In addition, we add PrivateHostnameEx property to dbus API which uses
string instead of boolean.
2024-12-06 13:33:49 -08:00
Štěpán Němec
597c6cc119 man: fix incorrect volume numbers in internal man page references
Some ambiguity (e.g., same-named man pages in multiple volumes)
makes it impossible to fully automate this, but the following
Python snippet (run inside the man/ directory of the systemd repo)
helped to generate the sed command lines (which were subsequently
manually reviewed, run and the false positives reverted):

from pathlib import Path

import lxml
from lxml import etree as ET

man2vol: dict[str, str] = {}
man2citerefs: dict[str, list] = {}

for file in Path(".").glob("*.xml"):
    tree = ET.parse(file, lxml.etree.XMLParser(recover=True))
    meta = tree.find("refmeta")
    if meta is not None:
        title = meta.findtext("refentrytitle")
        if title is not None:
            vol = meta.findtext("manvolnum")
            if vol is not None:
                man2vol[title] = vol
            citerefs = list(tree.iter("citerefentry"))
            if citerefs:
                man2citerefs[title] = citerefs

for man, refs in man2citerefs.items():
    for ref in refs:
        title = ref.findtext("refentrytitle")
        if title is not None:
            has = ref.findtext("manvolnum")
            try:
                should_have = man2vol[title]
            except KeyError:  # Non-systemd man page reference?  Ignore.
                continue
            if has != should_have:
                print(
                    f"sed -i '\\|<citerefentry><refentrytitle>{title}"
                    f"</refentrytitle><manvolnum>{has}</manvolnum>"
                    f"</citerefentry>|s|<manvolnum>{has}</manvolnum>|"
                    f"<manvolnum>{should_have}</manvolnum>|' {man}.xml"
                )
2024-11-11 20:31:08 +01:00
Zbigniew Jędrzejewski-Szmek
fe45f8dc9b man: drop whitespace from final <programlisting> lines
In the troff output, this doesn't seem to make any difference. But in the
html output, the whitespace is sometimes preserved, creating an additional
gap before the following content. Drop it everywhere to avoid this.
2024-11-08 14:14:36 +01:00
Lennart Poettering
607d297487 man: link up D-Bus API docs from daemon man pages
Let's systematically make sure that we link up the D-Bus interfaces from
the daemon man pages once in prose and once in short form at the bottom
("See Also"), for all daemons.

Also, add reverse links at the bottom of the D-Bus API docs.

Fixes: #34996
2024-11-05 22:57:51 +01:00
Daan De Meyer
406f177501 core: Introduce PrivatePIDs=
This new setting allows unsharing the pid namespace in a unit. Because
you have to fork to get a process into a pid namespace, we fork in
systemd-executor to get into the new pid namespace. The parent then
sends the pid of the child process back to the manager and exits while
the child process continues on with the rest of exec_invoke() and then
executes the actual payload.

Communicating the child pid is done via a new pidref socket pair that is
set up on manager startup.

We unshare the PID namespace right before the mount namespace so we
mount procfs correctly. Note PrivatePIDs=yes always implies MountAPIVFS=yes
to mount procfs.

When running unprivileged in a user session, user namespace is set up first
to allow for PID namespace to be unshared. However, when running in
privileged mode, we unshare the user namespace last to ensure the user
namespace does not own the PID namespace and cannot break out of the sandbox.

Note we disallow Type=forking services from using PrivatePIDs=yes since the
init proess inside the PID namespace must not exit for other processes in
the namespace to exist.

Note Daan De Meyer did the original work for this commit with Ryan Wilson
addressing follow-ups.

Co-authored-by: Daan De Meyer <daan.j.demeyer@gmail.com>
2024-11-05 05:32:02 -08:00
Luca Boccassi
890bdd1d77 core: add read-only flag for exec directories
When an exec directory is shared between services, this allows one of the
service to be the producer of files, and the other the consumer, without
letting the consumer modify the shared files.
This will be especially useful in conjunction with id-mapped exec directories
so that fully sandboxed services can share directories in one direction, safely.
2024-11-01 10:46:55 +00:00
Ryan Wilson
cd58b5a135 cgroup: Add support for ProtectControlGroups= private and strict
This commit adds two settings private and strict to
the ProtectControlGroups= property. Private will unshare the cgroup
namespace and mount a read-write private cgroup2 filesystem at /sys/fs/cgroup.
Strict does the same except the mount is read-only. Since the unit is
running in a cgroup namespace, the new root of /sys/fs/cgroup is the unit's
own cgroup.

We also add a new dbus property ProtectControlGroupsEx which accepts strings
instead of boolean. This will allow users to use private/strict via dbus
and systemd-run in addition to service files.

Note private and strict fall back to no and yes respectively if the kernel
doesn't support cgroup2 or system is not using unified hierarchy.

Fixes: #34634
2024-10-28 08:37:36 -07:00
Mike Yuan
7e40b51a2e man/org.freedesktop.systemd1: complete version info for ManagedOOMMemoryPressureDurationUSec
Follow-up for 63d4c4271c

Some unit types were left out.
2024-10-22 19:12:27 +02:00
Ryan Wilson
63d4c4271c cgroup: Add ManagedOOMMemoryPressureDurationSec= override setting for units
This will allow units (scopes/slices/services) to override the default
systemd-oomd setting DefaultMemoryPressureDurationSec=.

The semantics of ManagedOOMMemoryPressureDurationSec= are:
- If >= 1 second, overrides DefaultMemoryPressureDurationSec= from oomd.conf
- If is empty, uses DefaultMemoryPressureDurationSec= from oomd.conf
- Ignored if ManagedOOMMemoryPressure= is not "kill"
- Disallowed if < 1 second

Note the corresponding dbus property is DefaultMemoryPressureDurationUSec
which is in microseconds. This is consistent with other time-based
dbus properties.
2024-10-16 20:12:38 -07:00
Arthur Shau
cc0ab8c810 timer: introduce DeferReactivation setting
By default, in instances where timers are running on a realtime schedule,
if a service takes longer to run than the interval of a timer, the
service will immediately start again when the previous invocation finishes.
This is caused by the fact that the next elapse is calculated based on
the last trigger time, which, combined with the fact that the interval
is shorter than the runtime of the service, causes that elapse to be in
the past, which in turn means the timer will trigger as soon as the
service finishes running.

This behavior can be changed by enabling the new DeferReactivation setting,
which will cause the next calendar elapse to be calculated based on when
the trigger unit enters inactivity, rather than the last trigger time.

Thus, if a timer is on an realtime interval, the trigger will always
adhere to that specified interval.
E.g. if you have a timer that runs on a minutely interval, the setting
guarantees that triggers will happen at *:*:00 times, whereas by default
this may skew depending on how long the service runs.

Co-authored-by: Matteo Croce <teknoraver@meta.com>
2024-10-11 22:54:16 +02:00
Lennart Poettering
0aaacc3a10 Merge pull request #34593 from Werkov/deprecate-aux-scopes
core/manager: Deprecate StartAuxiliaryScope() method
2024-10-09 10:25:30 +02:00
Michal Koutný
64f173324e core/manager: Deprecate StartAuxiliaryScope() method
The method was added with migration of resources in mind (e.g. process's
allocated memory will follow it to the new scope), however, such a
resource migration is not in cgroup semantics. The method may thus have
the intended users and others could be guided to StartTransientUnit().

Since this API was advertised in a regular release, start the removal
with a deprecation message to callers.
Eventually, the goal is to remove the method to clean up DBus API and
simplify code (removal of cgroup_context_copy()).

Part of DBus docs is retained to satisfy build checks.
2024-10-08 17:49:13 +02:00
Ryan Wilson
3543456f84 Add ExtraFileDescriptor property to StartTransientUnit dbus API
This adds the ExtraFileDescriptor property to StartTransient dbus API
with format "a(hs)" - array of (file descriptor, name) pairs. The FD
will be passed to the unit via sd_notify like Socket and OpenFile.

systemctl show also shows ExtraFileDescriptorName for these transient
units. We only show the name passed to dbus as the FD numbers will
change once passed over the unix socket and are duplicated, so its
confusing to display the numbers.

We do not add this functionality for systemd-run or general systemd
service units as it is not useful for general systemd services.
Arguably, it could be useful for systemd-run in bash scripts but we
prefer to be cautious and not expose the API yet.

Fixes: #34396
2024-10-07 09:01:48 -07:00