All other resource control features work as 'best-effort', and failures
in applying them are handled gracefully. However, unlike the other features,
we tested if the BPF programs can be loaded and refuse execution on failure.
Moreover, the previous behavior of testing loading BPF programs had
inconsistency: the test was silently skipped if the cgroup for the unit does
not exist yet, but tested when the cgroup already exists.
Let's not handle failures in loading custom BPF programs as critical, but
gracefully ignore them, like we do for the other resource control features.
Follow-up for fab347489f.
This changes the instances of lexical to lexicographic, thus making it easier
to grep for instances of lexicographic order, since there's only one variant of
the word to consider.
Lexicographic is chosen since there are slightly fewer instances of lexical and
lexicographic seems a better fit than lexical after checking a few
dictionaries.
The words lexical, lexicographic, and lexicographical are synonyms in
computing, meaning an alphabetical order. Both the Oxford dictionary and
Merriam-Webster make no distinction between lexicographic and lexicographical,
with only Wiktionary adding a more precise meaning of
Meeting lexicographical standards or requirements; worthy of being included
in a dictionary. [1]
Since, outside of computing, lexicographic(al) has the more specific meaning
pertaining to lexicography, i.e. the editing or making of dictionaries [2], and
lexical only has this as a secondary meaning after its linguistic meaning [3],
lexicographic fits the meaning of including and ordering entries better.
[1] https://en.wiktionary.org/wiki/lexicographical#English
[2] https://www.merriam-webster.com/dictionary/lexicographic
[3] https://www.oed.com/dictionary/lexical_adj
Our baseline is v5.4 and cgroup v2 is enforced now,
which means CPU accounting is cheap everywhere without
requiring any controller, hence just remove the directive.
In the troff output, this doesn't seem to make any difference. But in the
html output, the whitespace is sometimes preserved, creating an additional
gap before the following content. Drop it everywhere to avoid this.
No functional change, but let's print yes/no rather than on/off in systemd-analyze.
Similar to 2e8a581b9c and
edd3f4d9b7.
(Note, the commit messages of those commits are wrong, as
parse_boolean() supports on/off anyway.)
This will allow units (scopes/slices/services) to override the default
systemd-oomd setting DefaultMemoryPressureDurationSec=.
The semantics of ManagedOOMMemoryPressureDurationSec= are:
- If >= 1 second, overrides DefaultMemoryPressureDurationSec= from oomd.conf
- If is empty, uses DefaultMemoryPressureDurationSec= from oomd.conf
- Ignored if ManagedOOMMemoryPressure= is not "kill"
- Disallowed if < 1 second
Note the corresponding dbus property is DefaultMemoryPressureDurationUSec
which is in microseconds. This is consistent with other time-based
dbus properties.
Certainly on systemd 252 at least a configuration of
```
MemorySwapMax=40%
```
is supported but this was missing from the man page.
Only MemoryMax was documented as supporting a %.
Global resource (whole system or root cg's (e.g. in a container)) is
also a well-defined limit for memory and tasks, take it into account
when calculating effective limits.
Users become perplexed when they run their workload in a unit with no
explicit limits configured (moreover, listing the limit property would
even show it's infinity) but they experience unexpected resource
limitation.
The memory and pid limits come as the most visible, therefore add new
unit read-only properties:
- EffectiveMemoryMax=,
- EffectiveMemoryHigh=,
- EffectiveTasksMax=.
These properties represent the most stringent limit systemd is aware of
for the given unit -- and that is typically(*) the effective value.
Implement the properties by simply traversing all parents in the
leaf-slice tree and picking the minimum value. Note that effective
limits are thus defined even for units that don't enable explicit
accounting (because of the hierarchy).
(*) The evasive case is when systemd runs in a cgroupns and cannot
reason about outer setup. Complete solution would need kernel support.
The benefit of using this setting is that user and group IDs, especially dynamic and random
IDs used by DynamicUser=, can be used in firewall configuration easily.
Example:
```
[Service]
NFTSet=user:inet:filter:serviceuser
```
Corresponding NFT rules:
```
table inet filter {
set serviceuser {
typeof meta skuid
}
chain service_output {
meta skuid @serviceuser accept
drop
}
}
```
```
$ cat /etc/systemd/system/dunft.service
[Service]
DynamicUser=yes
NFTSet=user:inet:filter:serviceuser
ExecStart=/bin/sleep 1000
[Install]
WantedBy=multi-user.target
$ sudo nft list set inet filter serviceuser
table inet filter {
set serviceuser {
typeof meta skuid
elements = { 64864 }
}
}
$ ps -n --format user,group,pid,command -p `systemctl show dunft.service -P MainPID`
USER GROUP PID COMMAND
64864 64864 55158 /bin/sleep 1000
```
New directive `NFTSet=` provides a method for integrating dynamic cgroup IDs
into firewall rules with NFT sets. The benefit of using this setting is to be
able to use control group as a selector in firewall rules easily and this in
turn allows more fine grained filtering. Also, NFT rules for cgroup matching
use numeric cgroup IDs, which change every time a service is restarted, making
them hard to use in systemd environment.
This option expects a whitespace separated list of NFT set definitions. Each
definition consists of a colon-separated tuple of source type (only "cgroup"),
NFT address family (one of "arp", "bridge", "inet", "ip", "ip6", or "netdev"),
table name and set name. The names of tables and sets must conform to lexical
restrictions of NFT table names. The type of the element used in the NFT filter
must be "cgroupsv2". When a control group for a unit is realized, the cgroup ID
will be appended to the NFT sets and it will be be removed when the control
group is removed. systemd only inserts elements to (or removes from) the sets,
so the related NFT rules, tables and sets must be prepared elsewhere in
advance. Failures to manage the sets will be ignored.
If the firewall rules are reinstalled so that the contents of NFT sets are
destroyed, command systemctl daemon-reload can be used to refill the sets.
Example:
```
table inet filter {
...
set timesyncd {
type cgroupsv2
}
chain ntp_output {
socket cgroupv2 != @timesyncd counter drop
accept
}
...
}
```
/etc/systemd/system/systemd-timesyncd.service.d/override.conf
```
[Service]
NFTSet=cgroup:inet:filter:timesyncd
```
```
$ sudo nft list set inet filter timesyncd
table inet filter {
set timesyncd {
type cgroupsv2
elements = { "system.slice/systemd-timesyncd.service" }
}
}
```
As I noticed a lot of missing information when trying to implement checking
for missing info. I reimplemented the version information script to be more
robust, and here is the result.
Follow up to ec07c3c80b
This tries to add information about when each option was added. It goes
back to version 183.
The version info is included from a separate file to allow generating it,
which would allow more control on the formatting of the final output.
Let's visually separate the options associated with cpu, io, memory, …
in subsections
This patch tries to be minimal. It just adds the section titles, and
does minimal reordering to make sure the options on the same kind of
resource are placed close to each other.
Doesn't really matter since the two unicode symbols are supposedly
equivalent, but let's better follow the unicode recommendations to
prefer greek small letter mu, as per:
https://www.unicode.org/reports/tr25
This implements a minimal subset of #24961, but in a lot more
restrictive way: we only allow one level of subcgroup (as that's enough
to address the no-processes in inner cgroups rule), and does not change
anything about threaded cgroup logic or similar, or make any of this new
behaviour mandatory.
All this does is this: all non-control processes we invoke for a unit
we'll invoke in a subgroup by the specified name.
We'll later port all our current services that use cgroup delegation
over to this, i.e. user@.service, systemd-nspawn@.service and
systemd-udevd.service.
This changes the PSI window duration we default to for watching memory
pressure events from 1s to 2s. This is because apparently the kernel
will soon disallow window durations other than 2s for unprivileged
processes.
Hence, we'll bump the threshold from 100m to 200ms, and the window from
1s to 2s.
For any user on a semi-recent kernel, effectively this setting is pointless.
We should deprecate it once not needed anymore for the v1 hierarchy. For
now, adjust the description.
When cpu controller is disabled, thing would often still behave as if
it was. And since the cpu controller can be enabled "magically" e.g. by
starting user@1000, add a note for users to be careful. Autogrouping
is described well in the man page, incl. how to enable or disable it,
so it should be enough to refer to that.
For CPUWeight=: there is an important distinction between our default of
[not set], and the kernel default of "100". Let's not say that our default
is "100" because then 'systemctl show' output is hard to explain.
For task accounting, it's the kernel that does the accounting, not systemd.
For a user, information which cgroup controllers are enabled based on
the unit configuration is rather important. Not only because it determines
what resource control is peformed by the kernel, but also because controllers
have a non-negligible cost, especially for deep nesting, and users may want
to *not* have controllers enabled.
Our documentation did its best to avoid the topic so far. This was partially
caused by the support for cgroup v1, which meant that any discussion of
controllers had to be conditional and messy. But v1 is deprecated on its way
out, so it should be fine to just describe what happens with v2.
The text is extended with a discussion of how controllers are enabled and
disabled, and an example, and for various settings that enable controllers
the relevant controller is now mentioned.
DeviceAllow= and others are applied to the whole cgroup via bpf, so
using '+' on an Exec line will not bypass them. Explain this in the
manpage.
Fixes https://github.com/systemd/systemd/issues/26035