Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths. It allows a device
to make use of multiple interfaces at once to send and receive TCP
packets over a single MPTCP connection. MPTCP can aggregate the
bandwidth of multiple interfaces or prefer the one with the lowest
latency, it also allows a fail-over if one path is down, and the traffic
is seamlessly re-injected on other paths.
To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [2]. To
use it on Linux, an application must explicitly enable it when creating
the socket:
int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP);
No need to change anything else in the application.
This patch allows MPTCP protocol in the Socket unit configuration. So
now, a <unit>.socket can contain this to use MPTCP instead of TCP:
[Socket]
SocketProtocol=mptcp
MPTCP support has been allowed similarly to what has been already done
to allow SCTP: just one line in core/socket.c, a very simple addition
thanks to the flexible architecture already in place.
On top of that, IPPROTO_MPTCP has also been added in the list of allowed
protocols in two other places, and in the doc. It has also been added to
the missing_network.h file, for systems with an old libc -- note that it
was also required to include <netinet/in.h> in this file to avoid
redefinition errors.
Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.mptcp.dev [2]
Currently the check also succeeds if the input path starts with a dot, whereas
we only want it to succeed for "." and "./". Tighten the check and add a test.
Import magic.h from Linux 6.9 to get the definition of
BCACHEFS_SUPER_MAGIC. Update filesystems-gperf.gperf to add knowledge of
bcachefs.
This fixes the following error building against a bleeding edge kernel.
```
src/basic/meson.build:234:8: ERROR: Problem encountered: Unknown filesystems defined in kernel headers:
Filesystem found in kernel header but not in filesystems-gperf.gperf: BCACHEFS_SUPER_MAGIC
```
fd_cloexec_many promised to report if work was done, but that code was
not effective, because it always reported true if any fds were open.
But no callers care about the return value, so let's just drop this.
This makes output a bit shorter and nicer. For us, shorter output is generally
better.
Also, drop unnecessary UINT64_C macros. The left operand is always uint64_t,
and C upcasting rules mean that it doesn't matter if the right operand is
narrower or signed, the operation is always done on the wider unsigned type.
Opening pidfds for non thread group leaders only works from 6.9 onwards with PIDFD_THREAD. On
older kernels or without PIDFD_THREAD pidfd_open() fails with EINVAL. Since we might read non
thread group leader IDs from cgroup.threads, we introduce and set CGROUP_NO_PIDFD to avoid
trying open pidfd's for them and instead use the pid as is.
As per the documentation, EACCES is only returned when F_SETLK is
used, and only on some platforms, which doesn't seem to include
Linux:
https://github.com/torvalds/linux/blob/master/fs/locks.c
F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW
are blocking operations so this logic doesn't apply to them in the
first place.
Hence, only automatically convert EACCES into EAGAIN for F_SETLK
operations, and propagate the original error in the other cases.
This is important because in some cases we catch permission errors
and gracefully fallback, which is not possible if the original error
is lost.
This is an issue in practice because, due to a kernel bug present
before v6.2, AppArmor denies locking on file descriptors to LXC
containers. We support all currently maintained LTS kernels,
including v6.1, where despite a lot of effort and attempts over almost
a year, the bugfix still hasn't been backported, as it is complex and
requires large changes to AppArmor.
On affected kernels, all services running with PrivateNetwork=yes
fail and do not recover, instead of the normal behaviour of gracefully
downgrading to PrivateNetwork=no.
The integration tests in the Debian CI fail due to this issue:
https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/
Currently, if proc_mounted() != 0, some functions
propagate -ENOENT while others return -EBADF.
Let's make things consistent, by introducing
a static inline helper responsible for finding out
the appropriate errno.
"norecovery" was deprecated for btrfs in
74ef00185e
and removed in
a1912f7121.
Let's drop our assumption that btrfs supports "norecovery" and first query for the
new name of the option followed by querying for the old name.
Use 'recommended' priority for the default compression library, to
indicate that it should be prioritized over the other ones, as it
will be used to compress journals/core files.
Also use 'recommended' for kmod, as systems will likely fail to boot
if it's missing from the initrd.
Use 'suggested' for everything else.
There is one dlopen'ed TPM library that has the name generated
at runtime (depending on the driver), so that cannot be added, as it
needs to be known at build time.
Also when we support multiple ABI versions list them all, as for the
same reason we cannot know which one will be used at build time.
$ dlopen-notes.py build/libsystemd.so.0.39.0 build/src/shared/libsystemd-shared-256.so
libarchive.so.13 suggested
libbpf.so.0 suggested
libbpf.so.1 suggested
libcryptsetup.so.12 suggested
libdw.so.1 suggested
libelf.so.1 suggested
libfido2.so.1 suggested
libgcrypt.so.20 suggested
libidn2.so.0 suggested
libip4tc.so.2 suggested
libkmod.so.2 recommended
liblz4.so.1 suggested
liblzma.so.5 suggested
libp11-kit.so.0 suggested
libpcre2-8.so.0 suggested
libpwquality.so.1 suggested
libqrencode.so.3 suggested
libqrencode.so.4 suggested
libtss2-esys.so.0 suggested
libtss2-mu.so.0 suggested
libtss2-rc.so.0 suggested
libzstd.so.1 recommended
Co-authored-by: Luca Boccassi <bluca@debian.org>
This allows code to declare "weak" dlopen() style deps via an ELF
section following the just added specification.
The idea is that any user of dlopen() will place ELF_NOTE_DLOPEN(…)
somewhere close which will synthesize the note.
Tools such as rpm/dpkg package builders as well as initrd generators
(such as dracut) can then automatically pick up these weak deps of
suggested dependencies for their purposes.
Co-authored-by: Luca Boccassi <bluca@debian.org>
Follow-up for 34c3d57474
O_RDONLY is dropped when O_DIRECTORY is specified, since
it's unnecessary and even arguably confusing here, as
the dir is modified.
This let's systemd-repart respect the `SOURCE_DATE_EPOCH` environment
variable when creating directories in the local tree through `CopyFiles`
or `MakeDirectories`.
To do this, we pass a timestamp `ts` to `mkdir_p_root`, which it will
use to fix up `mtime` and `atime` of the directory it creates as
well as the `mtime` of the directory it creates the other directory *in*,
as the `mtime` of the latter is modified when creating a directory in it.
For the same reason, it also needs to fixup the `mtime` of the upper
directory when copying a file into it through `CopyFiles`.
If `SOURCE_DATE_EPOCH`, times are left as is. (`UTIME_OMIT`)
Previously, _SOURCE_REALTIME_TIMESTAMP was only used for realtime
timestamp, and _SOURCE_MONOTONIC_TIMESTAMP was for monotonic.
This make these journal field used more aggressively. If we need
realtime timestamp, but an entry has only _SOURCE_MONOTONIC_TIMESTAMP,
then now realtime timestamp is calculated based on
_SOURCE_MONOTONIC_TIMESTAMP and the header dual timestamp.
Similary, monotonic timestamp is obtained from
_SOURCE_REALTIME_TIMESTAMP and the header dual timestamp.
This should change shown timestamps not so much in most cases, but may
be improve the situation such as #32492.
Required for integration tests to power off on PID 1 crashes. We
deprecate systemd.crash_reboot and related options by removing them
from the documentation but still parsing them.