This allows a service to reuse the user namespace created for an
existing service, similarly to NetworkNamespacePath=. The configuration
is the initial user namespace (e.g. ID mapping) is preserved.
This will allow units (scopes/slices/services) to override the default
systemd-oomd setting DefaultMemoryPressureDurationSec=.
The semantics of ManagedOOMMemoryPressureDurationSec= are:
- If >= 1 second, overrides DefaultMemoryPressureDurationSec= from oomd.conf
- If is empty, uses DefaultMemoryPressureDurationSec= from oomd.conf
- Ignored if ManagedOOMMemoryPressure= is not "kill"
- Disallowed if < 1 second
Note the corresponding dbus property is DefaultMemoryPressureDurationUSec
which is in microseconds. This is consistent with other time-based
dbus properties.
By default, in instances where timers are running on a realtime schedule,
if a service takes longer to run than the interval of a timer, the
service will immediately start again when the previous invocation finishes.
This is caused by the fact that the next elapse is calculated based on
the last trigger time, which, combined with the fact that the interval
is shorter than the runtime of the service, causes that elapse to be in
the past, which in turn means the timer will trigger as soon as the
service finishes running.
This behavior can be changed by enabling the new DeferReactivation setting,
which will cause the next calendar elapse to be calculated based on when
the trigger unit enters inactivity, rather than the last trigger time.
Thus, if a timer is on an realtime interval, the trigger will always
adhere to that specified interval.
E.g. if you have a timer that runs on a minutely interval, the setting
guarantees that triggers will happen at *:*:00 times, whereas by default
this may skew depending on how long the service runs.
Co-authored-by: Matteo Croce <teknoraver@meta.com>
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths. It allows a device
to make use of multiple interfaces at once to send and receive TCP
packets over a single MPTCP connection. MPTCP can aggregate the
bandwidth of multiple interfaces or prefer the one with the lowest
latency, it also allows a fail-over if one path is down, and the traffic
is seamlessly re-injected on other paths.
To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [2]. To
use it on Linux, an application must explicitly enable it when creating
the socket:
int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP);
No need to change anything else in the application.
This patch allows MPTCP protocol in the Socket unit configuration. So
now, a <unit>.socket can contain this to use MPTCP instead of TCP:
[Socket]
SocketProtocol=mptcp
MPTCP support has been allowed similarly to what has been already done
to allow SCTP: just one line in core/socket.c, a very simple addition
thanks to the flexible architecture already in place.
On top of that, IPPROTO_MPTCP has also been added in the list of allowed
protocols in two other places, and in the doc. It has also been added to
the missing_network.h file, for systems with an old libc -- note that it
was also required to include <netinet/in.h> in this file to avoid
redefinition errors.
Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.mptcp.dev [2]
Required for integration tests to power off on PID 1 crashes. We
deprecate systemd.crash_reboot and related options by removing them
from the documentation but still parsing them.
Today listen file descriptors created by socket unit don't get passed to
commands in Exec{Start,Stop}{Pre,Post}= socket options.
This prevents ExecXYZ= commands from accessing the created socket FDs to do
any kind of system setup which involves the socket but is not covered by
existing socket unit options.
One concrete example is to insert a socket FD into a BPF map capable of
holding socket references, such as BPF sockmap/sockhash [1] or
reuseport_sockarray [2]. Or, similarly, send the file descriptor with
SCM_RIGHTS to another process, which has access to a BPF map for storing
sockets.
To unblock this use case, pass ListenXYZ= file descriptors to ExecXYZ=
commands as listen FDs [4]. As an exception, ExecStartPre= command does not
inherit any file descriptors because it gets invoked before the listen FDs
are created.
This new behavior can potentially break existing configurations. Commands
invoked from ExecXYZ= might not expect to inherit file descriptors through
sd_listen_fds protocol.
To prevent breakage, add a new socket unit parameter,
PassFileDescriptorsToExec=, to control whether ExecXYZ= programs inherit
listen FDs.
[1] https://docs.kernel.org/bpf/map_sockmap.html
[2] https://lore.kernel.org/r/20180808075917.3009181-1-kafai@fb.com
[3] https://man.archlinux.org/man/socket.7#SO_INCOMING_CPU
[4] https://www.freedesktop.org/software/systemd/man/latest/sd_listen_fds.html
This commit adds a new way of forwarding journal messages - forwarding
over a socket.
The socket can be any of AF_INET, AF_INET6, AF_UNIUX or AF_VSOCK.
The address to connect to is retrieved from the "journald.forward_address" credential.
It can also be specified in systemd-journald's unit file with ForwardAddress=
This is the equivalent of RequiresMountsFor=, but adds Wants= instead
of Requires=. It will be useful for example for the autogenerated
systemd-cryptsetup units.
Fixes https://github.com/systemd/systemd/issues/11646
We don't care about the ordering, so we may just as well drop the numerical
prefixes that we normally use for sorting. Also rename some other samples
to keep width of output down to reasonable width.
This setting allows services to run in an ephemeral copy of the root
directory or root image. To make sure the ephemeral copies are always
cleaned up, we add a tmpfiles snippet to unconditionally clean up
/var/lib/systemd/ephemeral. To prevent in use ephemeral copies from
being cleaned up by tmpfiles, we use the newly added COPY_LOCK_BSD
and BTRFS_SNAPSHOT_LOCK_BSD flags to take a BSD lock on the ephemeral
copies which instruct tmpfiles to not touch those ephemeral copies as
long as the BSD lock is held.
Define new unit parameter (LogFilterPatterns) to filter logs processed by
journald.
This option is used to store a regular expression which is carried from
PID1 to systemd-journald through a cgroup xattrs:
`user.journald_log_filter_patterns`.
The lists of directives for fuzzer tests are maintained manually in the
repo. There is a tools/check-directives.sh script that runs during test
phase and reports stale directive lists.
Let's rework the script into a generator so that these directive files
are created on-the-flight and needn't be updated whenever a unit file
directives change. The scripts is rewritten in Python to get rid of gawk
dependency and each generated file is a separate meson target so that
incremental builds refresh what is just necessary (and parallelize
(negligible)).
Note: test/fuzz/fuzz-unit-file/directives-all.slice is kept since there
is not automated way to generate it (it is not covered by the check
script neither).
This reverts PR #22587 and its follow-up commit. More specifically,
2299b1cae3 (partially),
e176f855278d5098d3fecc5aa24ba702147d42e0,
ceb46a31a01b3d3d1d6095d857e29ea214a2776b, and
51bb9076ab8c050bebb64db5035852385accda35.
The PR was merged without final approval, and has several issues:
- OSS fuzz reported issues in the conf parser,
- It calls synchrnous netlink call, it should not be especially in PID1,
- The importance of NFTSet for CGroup and DynamicUser may be
questionable, at least, there was no justification PID1 should support
it.
- For networkd, it should be implemented with Request object,
- There is no test for the feature.
Fixes#23711.
Fixes#23717.
Fixes#23719.
Fixes#23720.
Fixes#23721.
Fixes#23759.
New directive `DynamicUserNFTSet=` provides a method for integrating
configuration of dynamic users into firewall rules with NFT sets.
Example:
```
table inet filter {
set u {
typeof meta skuid
}
chain service_output {
meta skuid != @u drop
accept
}
}
```
```
/etc/systemd/system/dunft.service
[Service]
DynamicUser=yes
DynamicUserNFTSet=inet:filter:u
ExecStart=/bin/sleep 1000
[Install]
WantedBy=multi-user.target
```
```
$ sudo nft list set inet filter u
table inet filter {
set u {
typeof meta skuid
elements = { 64864 }
}
}
$ ps -n --format user,group,pid,command -p `pgrep sleep`
USER GROUP PID COMMAND
64864 64864 55158 /bin/sleep 1000
```
All wiki pages that contain a deprecation banner
pointing to systemd.io or manpages are updated to
point to their replacements directly.
Helpful command for identification of available links:
git grep freedesktop.org/wiki | \
sed "s#.*\(https://www.freedesktop.org/wiki[^ $<'\\\")]*\)\(.*\)#\\1#" | \
sort | uniq
Add support for managing and configuring watchdog pretimeout values if
the watchdog hardware supports it. The ping interval is adjusted to
account for a pretimeout so that it will still ping at half the timeout
interval before a pretimeout event would be triggered. By default the
pretimeout defaults to 0s or disabled.
The RuntimeWatchdogPreSec config option is added to allow the pretimeout
to be specified (similar to RuntimeWatchdogSec). The
RuntimeWatchdogPreUSec dbus property is added to override the pretimeout
value at runtime (similar to RuntimeWatchdogUSec). Setting the
pretimeout to 0s will disable the pretimeout.
In cbcdcaaa0e ("Add support for conditions on the machines firmware")
a new Firmware= directive was added for .netdev and .network files.
While it was also documented to work on .link files, in actual fact the
support was missing. Add that one extra line to make it work, and also
update the fuzzer directives.
Add the "Isolated" parameter in the *.network file, e.g.,
[Bridge]
Isolated=true|false
When the Isolated parameter is true, traffic coming out of this port
will only be forward to other ports whose Isolated parameter is false.
When Isolated is not specified, the port uses the kernel default
setting (false).
The "Isolated" parameter was introduced in Linux 4.19.
See man bridge(8) for more details.
But even though the kernel and bridge/iproute2 recognize the "Isolated"
parameter, systemd-networkd did not have a way to set it.
Add a new setting that follows the same principle and implementation
as ExtensionImages, but using directories as sources.
It will be used to implement support for extending portable images
with directories, since portable services can already use a directory
as root.
By default checks PSI on /proc/pressure, and causes a unit to be skipped
if the threshold is above the given configuration for the avg300
measurement.
Also allow to pass a custom timespan, and a particular slice unit to
check under.
Fixes#20139
This introduces `ExitType=main|cgroup` for services.
Similar to how `Type` specifies the launch of a service, `ExitType` is
concerned with how systemd determines that a service exited.
- If set to `main` (the current behavior), the service manager will consider
the unit stopped when the main process exits.
- The `cgroup` exit type is meant for applications whose forking model is not
known ahead of time and which might not have a specific main process.
The service will stay running as long as at least one process in the cgroup
is running. This is intended for transient or automatically generated
services, such as graphical applications inside of a desktop environment.
Motivation for this is #16805. The original PR (#18782) was reverted (#20073)
after realizing that the exit status of "the last process in the cgroup" can't
reliably be known (#19385)
This version instead uses the main process exit status if there is one and just
listens to the cgroup empty event otherwise.
The advantages of a service with `ExitType=cgroup` over scopes are:
- Integrated logging / stdout redirection
- Avoids the race / synchronisation issue between launch and scope creation
- More extensive use of drop-ins and thus distro-level configuration:
by moving from scopes to services we can have drop ins that will affect
properties that can only be set during service creation,
like `OOMPolicy` and security-related properties
- It makes systemd-xdg-autostart-generator usable by fixing [1], as obviously
only services can be used in the generator, not scopes.
[1] https://bugs.kde.org/show_bug.cgi?id=433299
Currently there does not exist a way to specify a path relative to which
all binaries executed by Exec should be found. The only way is to
specify the absolute path.
This change implements the functionality to specify a path relative to which
binaries executed by Exec*= can be found.
Closes#6308
This reverts commit cb0e818f7c.
After this was merged, some design and implementation issues were discovered,
see the discussion in #18782 and #19385. They certainly can be fixed, but so
far nobody has stepped up, and we're nearing a release. Hopefully, this feature
can be merged again after a rework.
Fixes#19345.
This is like a really strong version of Wants=, that keeps starting the
specified unit if it is ever found inactive.
This is an alternative to Restart= inside a unit, acknowledging the fact
that whether to keep restarting the unit is sometimes not a property of
the unit itself but the state of the system.
This implements a part of what #4263 requests. i.e. there's no
distinction between "always" and "opportunistic". We just dumbly
implement "always" and become active whenever we see no job queued for
an inactive unit that is supposed to be upheld.