This commit introduces all the logic to load and attach the BPF
programs to restrict network interfaces when a unit specifying it is
loaded.
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
We recently started making more use of malloc_usable_size() and rely on
it (see the string_erase() story). Given that we don't really support
sytems where malloc_usable_size() cannot be trusted beyond statistics
anyway, let's go fully in and rework GREEDY_REALLOC() on top of it:
instead of passing around and maintaining the currenly allocated size
everywhere, let's just derive it automatically from
malloc_usable_size().
I am mostly after this for the simplicity this brings. It also brings
minor efficiency improvements I guess, but things become so much nicer
to look at if we can avoid these allocation size variables everywhere.
Note that the malloc_usable_size() man page says relying on it wasn't
"good programming practice", but I think it does this for reasons that
don't apply here: the greedy realloc logic specifically doesn't rely on
the returned extra size, beyond the fact that it is equal or larger than
what was requested.
(This commit was supposed to be a quick patch btw, but apparently we use
the greedy realloc stuff quite a bit across the codebase, so this ends
up touching *a*lot* of code.)
This should make it easier to remove those warnings when the compiler
gets smarter. Not sure if I got them all...
Double space before the comment start to make it easier to separate from the
preceding line.
Wherever we read virtual files we better should use
read_full_virtual_file(), to make sure we get a consistent response
given how weird the kernel's handling with partial read on such file
systems is.
When the test suite is being run in a foreign environment,
/sys/fs/cgroup might not be set up in a way that we recognize.
Returning ENOMEDIUM causes the tests to be skipped in this case.
Bug: https://bugs.gentoo.org/771819
This fixes two checks where we compare string sizes when validating with
FILENAME_MAX. In both cases the check apparently wants to check if the
name fits in a filename, but that's not actually what FILENAME_MAX can
be used for, as it — in contrast to what the name suggests — actually
encodes the maximum length of a path.
In both cases the stricter change doesn't actually change much, but the
use of FILENAME_MAX is still misleading and typically wrong.
systemd user instance assumed same controllers are available to it as to
PID 1. That is not true generally, in v1 (legacy, hybrid) we don't delegate any
controllers to anyone and in v2 (unified) we may delegate only subset of
controllers.
The user instance would fail silently when the controller cgroup cannot
be created or the controller cannot be enabled on the unified hierarchy.
The changes in 7b63961415 ("cgroup: Swap cgroup v1 deletion and
migration") caused some attempts of operating on non-delegated
controllers to be logged.
Make the user instance first check what controllers are availble to it
and narrow operations only to these controllers. The original checks are
kept in place.
Note that daemon-reexec needs to be invoked in order to update the set
of unabled controllers after a change.
Fixes: #18047Fixes: #17862
The function controller_is_accessible() doesn't do really much in case
of the unified hierarchy. Move common parts into cg_get_path_and_check
and make controller check v1 specific. This is refactoring only.
There may be situations where a cgroup should be protected from killing
or deprioritized as a candidate. In FB oomd xattrs are used to bias oomd
away from supervisor cgroups and towards worker cgroups in container
tasks. On desktops this can be used to protect important units with
unpredictable resource consumption.
The patch allows systemd-oomd to understand 2 xattrs:
"user.oomd_avoid" and "user.oomd_omit". If systemd-oomd sees these
xattrs set to 1 on a candidate cgroup (i.e. while attempting to kill something)
AND the cgroup is owned by root, it will either deprioritize the cgroup as
a candidate (avoid) or remove it completely as a candidate (omit).
Usage is restricted to root owned cgroups to prevent situations where an
unprivileged user can set their own cgroups lower in the kill priority than
another user's (and prevent them from omitting their units from
systemd-oomd killing).
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.
This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.
Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
Callers of cg_get_keyed_attribute_full() can now specify via the flag whether the
missing keyes in cgroup attribute file are OK or not. Also the wrappers for both
strict and graceful version are provided.
When nothing at all is mounted at /sys/fs/cgroup, the fs.f_type is
SYSFS_MAGIC (0x62656572) which results in the confusing debug log:
"Unknown filesystem type 62656572 mounted on /sys/fs/cgroup."
Instead, if the f_type is SYSFS_MAGIC, a more accurate message is:
"No filesystem is currently mounted on /sys/fs/cgroup."
A common pattern in the codebase is reading a cgroup memory value
and converting it to a uint64_t. Let's make it a helper and refactor a
few places to use it so it's more concise.
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller. This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.
The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written. However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well. In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
This way less stuff needs to be in basic. Initially, I wanted to move all the
parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because
efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the
move done in this patch is not necessary anymore. Nevertheless, it is still
valid on its own. If at some point we want to expose libbasic, it is better to
to not have stuff that belong in libshared there.
This avoid the use of the global variable.
Also rename cgroup_unified_update() to cgroup_unified_cached() and
cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.