It's not clear why we rescheduled a service auto restart while a stop job for
the unit was pending. The comment claims that the unit shouldn't be restarted
but the code did reschedule an auto restart meanwhile.
In practice that was rarely an issue because the service waited for the next
auto restart to be rescheduled, letting the queued stop job to be proceed and
service_stop() to be called preventing the next restart to complete.
However when RestartSec=0, the timer expired right away making PID1 to
reschedule the unit again, making the timer expired right away... and so
on. This busy loop prevented PID1 to handle any queued jobs (and hence giving
no chance to the start rate limiting to trigger), which made the busy loop last
forever.
This patch breaks this loop by skipping the reschedule of the unit auto restart
and hence not depending on the value of u->restart_usec anymore.
Fixes: #13667
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.
In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.
This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.
Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:
https://github.com/systemd/systemd/pull/22038Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.
Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.
Also-see: https://github.com/systemd/systemd/pull/22038#issuecomment-1042958390
Previously we'd only watch configured service bus names if Type=dbus was
set. Let's also watch it for other types. This is useful to pick up the
main PID of such a service. In fact the code to pick it up was already
in place, alas it didn't do anything given the signal was never received
for it. Fix that.
(It's also useful for debugging)
We would print "Setting NUMA policy to bind, with nodes .".
This is not very clear, change it to "… with nodes {}.".
Also use range formatting for masks to make output shorter.
For https://bugzilla.redhat.com/show_bug.cgi?id=1986176:
if we are trying to reexecute, and this fails for any reason, we shouldn't
try to execute /sbin/init or /bin/sh. It is better to just freeze.
If we freeze it is easier to diagnose what happened, but if we execute
one of the fallbacks, we don't really know what will happen. In particular
the new init might just return, causing the machine to shut down. Or we
may successfully spawn /bin/sh, which could leave the machine open.
If manager_loop() fails, we would print an error message, but then actually
ignore the error in main(), and potentially execute the shutdown binary.
I'm not sure how likely this is to happen in practice, but it seems sloppy.
So let's do the cleanup, but actually freeze() if manager_loop() returned
an error.
invoke_main_loop() is refactored to return the manager objective. This way
we don't need to pass a separate parameter to specify whether we are
reexecuting. Subsequent patch will make further use of the returned objective.
The cgroupid feature was not available in old cgroupvs2 kernels, hence
try to get it but if we can't because it's not supported, then only
debug log about it and proceed.
(We only needs this for cgroup bpf stuff, but that isn't available on
such old kernels anyway)
Fixes: #22483
Even though ISO C11 doesn't mandate in which order the type specifiers
should appear, having `unsigned` at the beginning of each type
declaration feels more natural and, more importantly, it unbreaks
Coccinelle, which has a hard time parsing `long unsigned` and others:
```
init_defs_builtins: /usr/lib64/coccinelle/standard.h
init_defs: /home/mrc0mmand/repos/systemd/coccinelle/macros.h
HANDLING: src/shared/mount-util.c
: 1: strange type1, maybe because of weird order: long unsigned
```
Most of the codebase already "complies", so let's fix the remaining
"offenders".
systemd[1016]: Failed to mount /tmp/app1 (type n/a) on /run/systemd/unit-extensions/1 (MS_BIND ): No such file or directory
systemd[1016]: Failed to create destination mount point node '/run/systemd/unit-extensions/1': File exists
A bind mount is added directly from private on the host to the actual
destination directory, no need for the symlinks (which cannot be created
as the bind mount happens first and creates the target as an actual directory)
Fixes https://github.com/systemd/systemd/issues/22264
When a Condition*= fails, and a service has Restart=always,
the service is not restarted.
Follow the same behaviour for ExecCondition= to avoid inconsistencies.
Fixes#22257
Rename the normalize_mounts() helper to drop_unused_mounts. All the
helpers called in there get rid of mounts that are unused for a variety
of reasons. And whereas the helpers are aptly prefixed with "drop" the
overall helper isn't and instead uses "normalize".
Make it more obvious what the helper actually does by renaming it from
normalize_mounts() to drop_unused_mounts(). Readers of code calling this
helper will immediately see that it will get rid of unused mounts.
Link: https://github.com/systemd/systemd/issues/22206
If a service requests both ProtectSubset=pid and ProtectHostname=true
then it will currently fail to start. The ProcSubset=pid option
instructs systemd to mount procfs for the service with subset=pid which
hides all entries other than /proc/<pid>. Consequently trying to
interact with the two files /proc/sys/kernel/{hostname,domainname}
covered by ProtectHostname=true will fail.
Fix this by only performing this check when ProtectSubset=pid is not
requested. Essentially ProtectSubset=pid implies/provides
ProtectHostname=true.
bpf-firewall and bpf-devices do not have names. This complicates
debugging with bpftool(8).
Assign names starting with 'sd_' prefix:
* firewall program names are 'sd_fw_ingress' for ingress attach
point and 'sd_fw_egress' for egress.
* 'sd_devices' for devices prog
'sd_' prefix is already used in source-compiled programs, e.g.
sd_restrictif_i, sd_restrictif_e, sd_bind6.
The name must not be longer than 15 characters or BPF_OBJ_NAME_LEN - 1.
Assign names only to programs loaded to kernel by systemd since
programs pinned to bpffs are already loaded.
Add a new setting that follows the same principle and implementation
as ExtensionImages, but using directories as sources.
It will be used to implement support for extending portable images
with directories, since portable services can already use a directory
as root.
Fixes#6308: people want to be able to link a unit file via 'systemctl enable'
from a git checkout or such and refer to other files in the same repo.
The new specifiers make that easy.
%y/%Y is used because other more obvious choices like %d/%D or %p/%P are
not available because at least on of the two letters is already used.
The new specifiers are only available in units. Technically it would be
trivial to add then in [Install] too, but I don't see how they could be
useful, so I didn't do that.
I added both %y and %Y because both were requested in the issue, and because I
think both could be useful, depending on the case. %Y to refer to other files
in the same repo, and %y in the case where a single repo has multiple unit files,
and e.g. each unit has some corresponding asset named after the unit file.