Import magic.h from Linux 6.9 to get the definition of
BCACHEFS_SUPER_MAGIC. Update filesystems-gperf.gperf to add knowledge of
bcachefs.
This fixes the following error building against a bleeding edge kernel.
```
src/basic/meson.build:234:8: ERROR: Problem encountered: Unknown filesystems defined in kernel headers:
Filesystem found in kernel header but not in filesystems-gperf.gperf: BCACHEFS_SUPER_MAGIC
```
fd_cloexec_many promised to report if work was done, but that code was
not effective, because it always reported true if any fds were open.
But no callers care about the return value, so let's just drop this.
This makes output a bit shorter and nicer. For us, shorter output is generally
better.
Also, drop unnecessary UINT64_C macros. The left operand is always uint64_t,
and C upcasting rules mean that it doesn't matter if the right operand is
narrower or signed, the operation is always done on the wider unsigned type.
Opening pidfds for non thread group leaders only works from 6.9 onwards with PIDFD_THREAD. On
older kernels or without PIDFD_THREAD pidfd_open() fails with EINVAL. Since we might read non
thread group leader IDs from cgroup.threads, we introduce and set CGROUP_NO_PIDFD to avoid
trying open pidfd's for them and instead use the pid as is.
As per the documentation, EACCES is only returned when F_SETLK is
used, and only on some platforms, which doesn't seem to include
Linux:
https://github.com/torvalds/linux/blob/master/fs/locks.c
F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW
are blocking operations so this logic doesn't apply to them in the
first place.
Hence, only automatically convert EACCES into EAGAIN for F_SETLK
operations, and propagate the original error in the other cases.
This is important because in some cases we catch permission errors
and gracefully fallback, which is not possible if the original error
is lost.
This is an issue in practice because, due to a kernel bug present
before v6.2, AppArmor denies locking on file descriptors to LXC
containers. We support all currently maintained LTS kernels,
including v6.1, where despite a lot of effort and attempts over almost
a year, the bugfix still hasn't been backported, as it is complex and
requires large changes to AppArmor.
On affected kernels, all services running with PrivateNetwork=yes
fail and do not recover, instead of the normal behaviour of gracefully
downgrading to PrivateNetwork=no.
The integration tests in the Debian CI fail due to this issue:
https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/
Currently, if proc_mounted() != 0, some functions
propagate -ENOENT while others return -EBADF.
Let's make things consistent, by introducing
a static inline helper responsible for finding out
the appropriate errno.
"norecovery" was deprecated for btrfs in
74ef00185e
and removed in
a1912f7121.
Let's drop our assumption that btrfs supports "norecovery" and first query for the
new name of the option followed by querying for the old name.
Use 'recommended' priority for the default compression library, to
indicate that it should be prioritized over the other ones, as it
will be used to compress journals/core files.
Also use 'recommended' for kmod, as systems will likely fail to boot
if it's missing from the initrd.
Use 'suggested' for everything else.
There is one dlopen'ed TPM library that has the name generated
at runtime (depending on the driver), so that cannot be added, as it
needs to be known at build time.
Also when we support multiple ABI versions list them all, as for the
same reason we cannot know which one will be used at build time.
$ dlopen-notes.py build/libsystemd.so.0.39.0 build/src/shared/libsystemd-shared-256.so
libarchive.so.13 suggested
libbpf.so.0 suggested
libbpf.so.1 suggested
libcryptsetup.so.12 suggested
libdw.so.1 suggested
libelf.so.1 suggested
libfido2.so.1 suggested
libgcrypt.so.20 suggested
libidn2.so.0 suggested
libip4tc.so.2 suggested
libkmod.so.2 recommended
liblz4.so.1 suggested
liblzma.so.5 suggested
libp11-kit.so.0 suggested
libpcre2-8.so.0 suggested
libpwquality.so.1 suggested
libqrencode.so.3 suggested
libqrencode.so.4 suggested
libtss2-esys.so.0 suggested
libtss2-mu.so.0 suggested
libtss2-rc.so.0 suggested
libzstd.so.1 recommended
Co-authored-by: Luca Boccassi <bluca@debian.org>
This allows code to declare "weak" dlopen() style deps via an ELF
section following the just added specification.
The idea is that any user of dlopen() will place ELF_NOTE_DLOPEN(…)
somewhere close which will synthesize the note.
Tools such as rpm/dpkg package builders as well as initrd generators
(such as dracut) can then automatically pick up these weak deps of
suggested dependencies for their purposes.
Co-authored-by: Luca Boccassi <bluca@debian.org>
Follow-up for 34c3d57474
O_RDONLY is dropped when O_DIRECTORY is specified, since
it's unnecessary and even arguably confusing here, as
the dir is modified.
This let's systemd-repart respect the `SOURCE_DATE_EPOCH` environment
variable when creating directories in the local tree through `CopyFiles`
or `MakeDirectories`.
To do this, we pass a timestamp `ts` to `mkdir_p_root`, which it will
use to fix up `mtime` and `atime` of the directory it creates as
well as the `mtime` of the directory it creates the other directory *in*,
as the `mtime` of the latter is modified when creating a directory in it.
For the same reason, it also needs to fixup the `mtime` of the upper
directory when copying a file into it through `CopyFiles`.
If `SOURCE_DATE_EPOCH`, times are left as is. (`UTIME_OMIT`)
Previously, _SOURCE_REALTIME_TIMESTAMP was only used for realtime
timestamp, and _SOURCE_MONOTONIC_TIMESTAMP was for monotonic.
This make these journal field used more aggressively. If we need
realtime timestamp, but an entry has only _SOURCE_MONOTONIC_TIMESTAMP,
then now realtime timestamp is calculated based on
_SOURCE_MONOTONIC_TIMESTAMP and the header dual timestamp.
Similary, monotonic timestamp is obtained from
_SOURCE_REALTIME_TIMESTAMP and the header dual timestamp.
This should change shown timestamps not so much in most cases, but may
be improve the situation such as #32492.
Required for integration tests to power off on PID 1 crashes. We
deprecate systemd.crash_reboot and related options by removing them
from the documentation but still parsing them.
While stracing PID1's forking off of children I noticed that every
single forked off child reads cap_last_cap from procfs. That value is a
kernel constant, hence we can save a lot of work if we'd cache it.
Thing is, we actually do cache it, in a thread_local cache field. This
means that the forked off processes (which are considered new threads)
will have to re-query it, even though we already know the result.
Hence, let's get rid of the thread_local stuff (given that the value is
going to be the same for all threads anyway, and we pretty much have a
single thread only anyway). Use an C11 atomic_int instead, which ensures
the value is either initialized or not initialized, but we don't need to
be concerned of partial initialization.
This makes the cap_last_cap reading go away in the children, as strace
shows (since cap_last_cap() is already called by PID 1 before
fork()ing, anyway).
Doing this in reset_terminal_fd() is a bit too invasive, see
https://github.com/systemd/systemd/pull/32406#issuecomment-2070923583.
Let's only do this for /dev/console so that we work around weird firmwares
disabling line-wrapping, but avoid messing too much with other things.
While we're at it, let's handle more than just line wrapping, and do a
more general reset of stuff to get the terminal into a sane state.