Instead of mounting over, do an atomic swap using mount beneath, if
available. This way assets can be mounted again and again (e.g.:
updates) without leaking mounts.
For a given PID and namespace type, this helper function gives the PID
of the leader of the namespace containing the given PID. Use this in
systemd-coredump instead of using the existing get_mount_namespace_leader.
This helper will be used again in a later commit.
This setting indicates that the given unit wants to receive coredumps
for processes that crash within the cgroup of this unit. This setting
requires that Delegate= is also true, and therefore is only available
where Delegate= is available.
This will be used by systemd-coredump to support forwarding coredumps to
containers.
Currently we spawn services by forking a child process, doing a bunch
of work, and then exec'ing the service executable.
There are some advantages to this approach:
- quick: we immediately have access to all the enourmous amount of
state simply by virtue of sharing the memory with the parent
- easy to refactor and add features
- part of the same binary, will never be out of sync
There are however significant drawbacks:
- doing work after fork and before exec is against glibc's supported
case for several APIs we call
- copy-on-write trap: anytime any memory is touched in either parent
or child, a copy of that page will be triggered
- memory footprint of the child process will be memory footprint of
PID1, but using the cgroup memory limits of the unit
The last issue is especially problematic on resource constrained
systems where hard memory caps are enforced and swap is not allowed.
As soon as PID1 is under load, with no page out due to no swap, and a
service with a low MemoryMax= tries to start, hilarity ensues.
Add a new systemd-executor binary, that is able to receive all the
required state via memfd, deserialize it, prepare the appropriate
data structures and call exec_child.
Use posix_spawn which uses CLONE_VM + CLONE_VFORK, to ensure there is
no copy-on-write (same address space will be used, and parent process
will be frozen, until exec).
The sd-executor binary is pinned by FD on startup, so that we can
guarantee there will be no incompatibilities during upgrades.
This reworks the image discovery logic, and conceptually allows DDIs
that are both confext and sysext to exist. Previously we'd only extract
one type of exension data from a DDI, with this we allow to extract both
if both exist.
This doesn't add support for true "multi-modal" DDIs, that qualify as
various things at once, it just lays some ground work that ensures we at
least can dissect such images.
This reworks 484d26dac1 quite a bit.
This changes systemd-dissect's JSON output, but given the
version with the fields it changes/dops has never been released (as the
above patch was merged post-v254) this shouldn't be an issue.
We have the "tasks.max" cgroup attribute only if we run in a cgroup
namespace, but not on the host. Hence let's handle ENODATA silently
simply to reduce the debug noise generated.
Let's modernize and clean up search_and_fopen a bit: let's add support
for regular open() (instead of fopen()), as well as access() (if caller
just wants to check if a file exists without opening it.
This unifies much of the code involved, which previously was duplicated
in search_and_fopen() and search_and_fopen_nulstr()
mount_option_supported() will call fsopen() which will probe the
kernel filesystem module. This means that we'll suddenly start
probing filesystem modules when running generators as those determine
which mount options to use. To prevent generators from loading kernel
filesystem modules as much as possible, let's always first check the
hardcoded list of filesystem which we know support a feature before
falling back to asking the kernel.
systemd's own cgroup hierarchy is special to us, we use it to actually
manage processes. Because of that many calls tha apply to cgroups are
only ever called with the SYSTEMD_CGROUP_CONTROLLER as controller
argument. Let's hence remove the argument altogether.
This in particular touches the kill and xattr routines.
This changes no behaviour, we just drop an argument that is always set
to the same value anyway.
This is preparation to eventually getting rid of the cgroupvs1, because
on cgroupvs2 the cgroup paths do not change for different controllers,
there's only a single hierarchy there.
note that this slightly changes the semantic of assert when NDEBUG is
defined. if there's an extern function call (without attribute pure or
similar) then the compiler has to assume it has side effects and still
emit the function call.
whereas the old assert guaranteed that nothing will be evaluated on
NDEBUG.
Closes: https://github.com/systemd/systemd/issues/29408
When READ_FULL_FILE_UNBASE64 (or READ_FULL_FILE_UNHEX) is specified,
setting size argument by caller is difficult, as it is hard to estimate
the encoded length.
This makes when size is specified with decoding option, let's read file
more, and check decoded size later with the specified size.
This new helper can be used after reading process info from procfs, to
verify that the data that was just read actually matches the pidfd, and
does not belong to some new process that just reused the numeric PID of
the process we originally pinned.
Usually we want to embed PidRef in other structures, but sometimes it
makes sense to allocate it on the heap in case it should be used
standalone. Add helpers for that.
Primary usecase: use as key in Hashmap objects, that for example map
process to unit objects in PID 1.
This adds pidref_free()/pidref_freep() for freeing such an allocated
struct, as well as pidref_dup() (for duplicating an existing PidRef
on the heap 1:1), and pidref_new_pid() (for allocating a new PidRef from a
PID).
This helper truns a pid_t into a PidRef. It's different from
pidref_set_pid() in being "passive", i.e. it does not attempt to acquire
a pidfd for the pid.
This is useful when using the PidRef as a lookup key that shall also
work after a process is already dead, and hence no conversion to a pidfd
is possible anymore.
'systemctl status /../dev' now looks for 'dev.mount', not '-..-dev.service',
and 'systemctl status /../foo' looks for 'foo.mount', not '-..-foo.service'. I
think this much more useful. I think the escaping is not very useful, so I plan
to submit a later series which changes that behaviour. But I think this first
step here is already useful on its own.
Note that the patch is smaller than it seems: before, is_device_path() would
return true only for absolute paths, so moving of is_device_path() under the
path_is_absolute() conditional doesn't influence the logic.
So, unfortunately oomd uses "io.system." rather than "io.systemd." as
prefix for its sockets. This is a mistake, and doesn't match the
Varlink interface naming or anything else in oomd.
hence, let's fix that.
Given that this is an internal protocol between PID1 and oomd let's
simply change this without retaining compat.