Opening pidfds for non thread group leaders only works from 6.9 onwards with PIDFD_THREAD. On
older kernels or without PIDFD_THREAD pidfd_open() fails with EINVAL. Since we might read non
thread group leader IDs from cgroup.threads, we introduce and set CGROUP_NO_PIDFD to avoid
trying open pidfd's for them and instead use the pid as is.
As per the documentation, EACCES is only returned when F_SETLK is
used, and only on some platforms, which doesn't seem to include
Linux:
https://github.com/torvalds/linux/blob/master/fs/locks.c
F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW
are blocking operations so this logic doesn't apply to them in the
first place.
Hence, only automatically convert EACCES into EAGAIN for F_SETLK
operations, and propagate the original error in the other cases.
This is important because in some cases we catch permission errors
and gracefully fallback, which is not possible if the original error
is lost.
This is an issue in practice because, due to a kernel bug present
before v6.2, AppArmor denies locking on file descriptors to LXC
containers. We support all currently maintained LTS kernels,
including v6.1, where despite a lot of effort and attempts over almost
a year, the bugfix still hasn't been backported, as it is complex and
requires large changes to AppArmor.
On affected kernels, all services running with PrivateNetwork=yes
fail and do not recover, instead of the normal behaviour of gracefully
downgrading to PrivateNetwork=no.
The integration tests in the Debian CI fail due to this issue:
https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/
"norecovery" was deprecated for btrfs in
74ef00185e
and removed in
a1912f7121.
Let's drop our assumption that btrfs supports "norecovery" and first query for the
new name of the option followed by querying for the old name.
Use 'recommended' priority for the default compression library, to
indicate that it should be prioritized over the other ones, as it
will be used to compress journals/core files.
Also use 'recommended' for kmod, as systems will likely fail to boot
if it's missing from the initrd.
Use 'suggested' for everything else.
There is one dlopen'ed TPM library that has the name generated
at runtime (depending on the driver), so that cannot be added, as it
needs to be known at build time.
Also when we support multiple ABI versions list them all, as for the
same reason we cannot know which one will be used at build time.
$ dlopen-notes.py build/libsystemd.so.0.39.0 build/src/shared/libsystemd-shared-256.so
libarchive.so.13 suggested
libbpf.so.0 suggested
libbpf.so.1 suggested
libcryptsetup.so.12 suggested
libdw.so.1 suggested
libelf.so.1 suggested
libfido2.so.1 suggested
libgcrypt.so.20 suggested
libidn2.so.0 suggested
libip4tc.so.2 suggested
libkmod.so.2 recommended
liblz4.so.1 suggested
liblzma.so.5 suggested
libp11-kit.so.0 suggested
libpcre2-8.so.0 suggested
libpwquality.so.1 suggested
libqrencode.so.3 suggested
libqrencode.so.4 suggested
libtss2-esys.so.0 suggested
libtss2-mu.so.0 suggested
libtss2-rc.so.0 suggested
libzstd.so.1 recommended
Co-authored-by: Luca Boccassi <bluca@debian.org>
This allows code to declare "weak" dlopen() style deps via an ELF
section following the just added specification.
The idea is that any user of dlopen() will place ELF_NOTE_DLOPEN(…)
somewhere close which will synthesize the note.
Tools such as rpm/dpkg package builders as well as initrd generators
(such as dracut) can then automatically pick up these weak deps of
suggested dependencies for their purposes.
Co-authored-by: Luca Boccassi <bluca@debian.org>
Follow-up for 34c3d57474
O_RDONLY is dropped when O_DIRECTORY is specified, since
it's unnecessary and even arguably confusing here, as
the dir is modified.
This let's systemd-repart respect the `SOURCE_DATE_EPOCH` environment
variable when creating directories in the local tree through `CopyFiles`
or `MakeDirectories`.
To do this, we pass a timestamp `ts` to `mkdir_p_root`, which it will
use to fix up `mtime` and `atime` of the directory it creates as
well as the `mtime` of the directory it creates the other directory *in*,
as the `mtime` of the latter is modified when creating a directory in it.
For the same reason, it also needs to fixup the `mtime` of the upper
directory when copying a file into it through `CopyFiles`.
If `SOURCE_DATE_EPOCH`, times are left as is. (`UTIME_OMIT`)
Previously, _SOURCE_REALTIME_TIMESTAMP was only used for realtime
timestamp, and _SOURCE_MONOTONIC_TIMESTAMP was for monotonic.
This make these journal field used more aggressively. If we need
realtime timestamp, but an entry has only _SOURCE_MONOTONIC_TIMESTAMP,
then now realtime timestamp is calculated based on
_SOURCE_MONOTONIC_TIMESTAMP and the header dual timestamp.
Similary, monotonic timestamp is obtained from
_SOURCE_REALTIME_TIMESTAMP and the header dual timestamp.
This should change shown timestamps not so much in most cases, but may
be improve the situation such as #32492.
Required for integration tests to power off on PID 1 crashes. We
deprecate systemd.crash_reboot and related options by removing them
from the documentation but still parsing them.
While stracing PID1's forking off of children I noticed that every
single forked off child reads cap_last_cap from procfs. That value is a
kernel constant, hence we can save a lot of work if we'd cache it.
Thing is, we actually do cache it, in a thread_local cache field. This
means that the forked off processes (which are considered new threads)
will have to re-query it, even though we already know the result.
Hence, let's get rid of the thread_local stuff (given that the value is
going to be the same for all threads anyway, and we pretty much have a
single thread only anyway). Use an C11 atomic_int instead, which ensures
the value is either initialized or not initialized, but we don't need to
be concerned of partial initialization.
This makes the cap_last_cap reading go away in the children, as strace
shows (since cap_last_cap() is already called by PID 1 before
fork()ing, anyway).
Doing this in reset_terminal_fd() is a bit too invasive, see
https://github.com/systemd/systemd/pull/32406#issuecomment-2070923583.
Let's only do this for /dev/console so that we work around weird firmwares
disabling line-wrapping, but avoid messing too much with other things.
While we're at it, let's handle more than just line wrapping, and do a
more general reset of stuff to get the terminal into a sane state.
The qemu seabios firmware disables serial console line wrapping. Let's
make sure we re-enable it again when we reset a terminal to some sane
defaults.
To avoid potentially blocking on writing to the terminal, we put it
in nonblocking mode and add a timeout of 50ms.
We shouldn't try to use any ANSI escape sequences if TERM=dumb.
Also, the "\r\n" we output can get interpreted as a double newline
(for example by Github Actions), so let's output just "\n" when
TERM=dumb to clean up the CI logs.
I'm working on the transition to merged sbin in Fedora. While the transition is
happening (and probably for a while after), we need to compile systemd with
split-bin=true to support systems upgraded from previous versions. But when the
system has been upgraded and already has /usr/sbin that is a symlink, be nice
and give $PATH without sbin.
We check for both /usr/sbin and /usr/local/sbin. If either exists and is not a
symlink to ./bin, we retain previous behaviour. This means that if both are
converted, we get the same behaviour as split-bin=false, and otherwise we
get the same behaviour as before.
sd-path uses the same logic. This is not a hot path, so I got rid of the nulstr
macros that duplicated the logic.
If we're already running in a unit with delegation turned on, let's
skip allocation of a scope unit and cgroup subroot. This allows journald
to correctly attribute the logs of all subprocesses spawned by tests such
as test-execute to the test-execute service when the test is running in a service.