Use the flags MEMFD_EXEC or MEMFD_NOEXEC_SEAL as applicable.
These warnings instruct the kernel wether the memfd is executable or
not.
Without specifying those flags the kernel will emit the following
warning since version 6.3,
commit 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC"):
kernel: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'
For nspawn and services we first turn off two-way propagation of mounts
from host to sandbox via MS_SLAVE, and then set MS_SHARED again, so that
we create a new mount prop peer group again, and that we provide
behaviour similar to what we provide on the host further down the tree.
Let's do the same in detach_mount_namespace(), which we use for the
temporary mounts in the implementation of --image= in various tools.
This doesn't fix any immediate issue, but ensures we expose somewhat
systematic behaviour: whenever we detach mount namespaces we always set
things back to MS_SLAVE in the child.
On some ARM platforms, the dynamic linker could use PROT_BTI memory protection
flag with `mprotect(..., PROT_BTI | PROT_EXEC)` to enable additional memory
protection for executable pages. But `MemoryDenyWriteExecute=yes` blocks this
with seccomp filter denying all `mprotect(..., x | PROT_EXEC)`.
Newly preferred method is to use prctl(PR_SET_MDWE) on supported kernels. Then
in-kernel implementation can allow PROT_BTI as necessary, without weakening
MDWE. In-kernel version may also be extended to more sophisticated protections
in the future.
All daemons use a similar scheme to read their main config files and theirs
drop-ins. The main config files are always stored in /etc/systemd directory and
it's easy enough to construct the name of the drop-in directories based on the
name of the main config file.
Hence the new helper does that internally, which allows to reduce and simplify
the args passed previously to config_parse_many_nulstr().
Besides the overall code simplification it results:
16 files changed, 87 insertions(+), 159 deletions(-)
it allows to identify clearly the locations in the code where configuration
files are parsed.
In various tools and services we have a per-system and per-user concept.
So far we sometimes used a boolean indicating whether we are in system
mode, or a reversed boolean indicating whether we are in user mode, or
the LookupScope enum used by the lookup path logic.
Let's address that, in introduce a common enum for this, we can use all
across the board.
This is mostly just search/replace, no actual code changes.
This changes the PSI window duration we default to for watching memory
pressure events from 1s to 2s. This is because apparently the kernel
will soon disallow window durations other than 2s for unprivileged
processes.
Hence, we'll bump the threshold from 100m to 200ms, and the window from
1s to 2s.
Add macros to manage bits in a bitfield (e.g. uint32_t, uint64_t, etc),
such as setting, clearing, checking bits, and iterating all set bits.
These are similiar to the bitmap operations, but operate on basic types
instead of requiring a Bitmap object.