inotify FD may take several milliseconds to close. We measured
daemon-reload
default: (0.427 ± 0.05) s
async: (0.323 ± 0.02) s
with 5 path units out of 422 units. I.e. ~1% of units cause ~25% of
delay, hence this fix seems like low-hanging fruit on the daemon-reload
critical path.
Particular inotify slowness pointed out by @fbuihuu.
To resolve conflict with sys/mount.h and linux/mount.h or linux/fs.h.
The conflict between sys/mount.h and linux/mount.h is resolved in
glibc-2.37 (774058d72942249f71d74e7f2b639f77184160a6), but our baseline
is still glibc-2.31. Also, even with the version or newer, still
sys/mount.h conflicts with linux/fs.h, which is included by
linux/btrfs.h.
This introduces our own implementation of sys/mount.h, that can be
simultaneously included with linux/mount.h and linux/fs.h. This also
imports linux/fs.h, linux/mount.h, and several other dependent headers.
The introduced sys/mount.h header itself may not be enough simple, but
by using the header, we can drop most of workarounds in other source files.
This delegates one or more namespaces to the service. Concretely,
this setting influences in which order we unshare namespaces. Delegated
namespaces are unshared *after* the user namespace is unshared. Other
namespaces are unshared *before* the user namespace is unshared.
Fixes#35369
(Note: we also change TEST-13-NSPAWN.machined.sh minimally here, because
it checks for byte precise output of a pty allocated for a service
invocation - which it's not going to get if it claims that the pty is an
all-powerful one. After all this PR ensures that we'll generate the new
OSC sequence on non-dumb terminals associated with services. Hence, set
TERM=dumb explicitly to ensure no ANSI sequences are generated, ever.
Which is a nice test btw that TERM=dumb really does its thing here.)
So far we conditioned the logic that issues ansi sequences for resetting
the TTY based on whether something is a pty is not (under the assumption
we need no reset on ptys, since they are shortlived).
This is simply wrong though. The pty that a container getty is invoked
on is generally long-lived: as long as the container is up, and it will
be reused between getty instances/sessions all the time. In such a case
we really should reset properly.
Let's instead make the logic dependent on whether TERM is set to
anything other than "dumb". The previous commit made sure we always set
TERM in a sensible way in systemd-run, hence this
*explicit* logic sounds like a much better choice now, as it mea
let's convert the 2nd argumeng form a boolean to a proper flags
parameter. Doesn't change behaviour in anyway, but is more readable, and
prepares ground for adding more flags soon.
Follow-up for 3bd28bf721
SERVICE_RELOAD_SIGNAL state can only be reached via explicit reload jobs,
and we have a clear distinction between that and plain RELOADING=1
notifications, the latter of which is issued by clients doing reload
outside of our job engine. I.e. upon SERVICE_RELOAD_SIGNAL + RELOADING=1
we don't propagate reload jobs again, since that's done during transaction
construction stage already. The handling of combined RELOADING=1 + READY=1
so far is bogus however, as it tries to propagate duplicate reload jobs.
Amend this by following the logic for standalone RELOADING=1.
This has been depracted since v254 (2023). Let's kill it for
good now, it has been long enough with 2y. Noone has shown up who wants
to keep it. And given it doesn't work in SB world anyway, and is not
measured is quite problematic security wise.
In Arch Linux we currently have a half-baked apparmor support,
in particular we cannot link systemd to libapparmor for service
context integration, since that will pull apparmor into base system.
Hence, let's turn this into a dlopen dep.
Ref: https://gitlab.archlinux.org/archlinux/packaging/packages/systemd/-/issues/22
This ensures the child process is immediately re-parented to the
manager process which avoids a "Supervising process xxx which is
not our child. We'll most likely not notice when it exits." warning
which can currently happen if the parent systemd-executor parent
process sends the pid namespace child process pidref to the manager
process and the manager process dispatches the child process pidref
before the systemd-executor parent process exits, since at that point
the pid namespace child process's parent will still be the
systemd-executor parent process and not the manager process.
This also ports over things to use chase() to create/pin the underlying
to mount, and in particular checks that the path does not contain any
symlinks. That's crucial since we cannot allow mounts to be established
with that, since it would mean we couldn't recognize the entries in
/proc/self/mountinfo anymore.
09fbff57fc introduced new knob
for such functionality. However, that seems unnecessary.
The mount option string is ubiquitous in that all of fstab,
kernel cmdline, credentials, systemd-mount, ... speak it.
And we already have x-systemd.device-bound= that's parsed
by pid1 instead of fstab-generator. It feels hence more natural
for graceful options to be an extension of that, rather than
its own property.
There's also one nice side effect that the setting itself
is now more graceful for systemd versions not supporting
such feature.
Follow-up for a1d315730f
and 6ac62d61db
With the aforementioned commits, unit_release_resources()
is dispatched in a dedicated queue, and Service.n_keep_fd_store
has been dropped, hence the comment is outdated. Moreover,
the unit is added to GC queue in unit_notify() already.
No other unit types do this in corresponding _enter_dead()
functions, nor does Service need it anymore.
I.e. don't perform any action if we can't spawn mount task anyway.
Later the same check would be added to mount_can_start/reload(),
so this makes things more coherent too.
There are cases where templated Socket unit files are used for network services
with interface name used as an instance. This patch allows using %i for
BindToDevice setting to limit the scope automatically.
Some bind mounts, e.g. /tmp bind mount when PrivateTmp=disconnected,
must be explicitly relabeled because now it would have incorrect SELinux
label. /tmp is expected to have well-known SELinux label, tmp_t. Now it
has label inherited from the source directory of the bind mount.
This follows the same recent change in util-linux:
https://github.com/util-linux/util-linux/pull/3354
i.e. we generally want that PAM modules can override $HOME and it is
honoured for the CWD after login.
(This renames the 'home' variable we maintained sofar to 'pwent_home',
to clarify that it's the home directory listed in the struct passwd
entry, and thus not necessarily the one actually used)