The usual behaviour when a timeout expires is to terminate/kill the
service. This is what user usually want in production systems. To debug
services that fail to start/stop (especially sporadic failures) it
might be necessary to trigger the watchdog machinery and write core
dumps, though. Likewise, it is usually just a waste of time to
gracefully stop a stuck service. Instead it might save time to go
directly into kill mode.
This commit adds two new options to services: TimeoutStartFailureMode=
and TimeoutStopFailureMode=. Both take the same values and tweak the
behavior of systemd when a start/stop timeout expires:
* 'terminate': is the default behaviour as it has always been,
* 'abort': triggers the watchdog machinery and will send SIGABRT
(unless WatchdogSignal was changed) and
* 'kill' will directly send SIGKILL.
To handle the stop failure mode in stop-post state too a new
final-watchdog state needs to be introduced.
Let's allow "-0" as alternative to "+0" and "0" when parsing integers,
unless the new SAFE_ATO_REFUSE_PLUS_MINUS flag is specified.
In cases where allowing the +/- syntax shall not be allowed
SAFE_ATO_REFUSE_PLUS_MINUS is the right flag to use, but this also means
that -0 as only negative integer that fits into an unsigned value should
be acceptable if the flag is not specified.
Quoting https://github.com/systemd/systemd/issues/14828#issuecomment-635212615:
> [kernel uses] msleep_interruptible() and that means when the process receives
> any kind of signal masked or not this will abort with EINTR. systemd-logind
> gets signals from the TTY layer all the time though.
> Here's what might be happening: while logind reads the EFI stuff it gets a
> series of signals from the TTY layer, which causes the read() to be aborted
> with EINTR, which means logind will wait 50ms and retry. Which will be
> aborted again, and so on, until quite some time passed. If we'd not wait for
> the 50ms otoh we wouldn't wait so long, as then on each signal we'd
> immediately retry again.
We would parse numbers with base prefixes as user identifiers. For example,
"0x2b3bfa0" would be interpreted as UID==45334432 and "01750" would be
interpreted as UID==1000. This parsing was used also in cases where either a
user/group name or number may be specified. This means that names like
0x2b3bfa0 would be ambiguous: they are a valid user name according to our
documented relaxed rules, but they would also be parsed as numeric uids.
This behaviour is definitely not expected by users, since tools generally only
accept decimal numbers (e.g. id, getent passwd), while other tools only accept
user names and thus will interpret such strings as user names without even
attempting to convert them to numbers (su, ssh). So let's follow suit and only
accept numbers in decimal notation. Effectively this means that we will reject
such strings as a username/uid/groupname/gid where strict mode is used, and try
to look up a user/group with such a name in relaxed mode.
Since the function changed is fairly low-level and fairly widely used, this
affects multiple tools: loginctl show-user/enable-linger/disable-linger foo',
the third argument in sysusers.d, fourth and fifth arguments in tmpfiles.d,
etc.
Fixes#15985.
"internal" is a lot of characters. Let's take a leaf out of the Python's book
and simply use _ to mean private. Much less verbose, but the meaning is just as
clear, or even more.
This is a safey net anyway, let's make it fully safe: if the data ends
on an uneven byte, then we need to complete the UTF-16 codepoint first,
before adding the final NUL byte pair. Hence let's suffix with three
NULs, instead of just two.
Tracking down #15931 confused the hell out of me, since running homed in
gdb from the command line worked fine, but doing so as a service failed.
Let's make this more debuggable and check if we live in the host netns
when allocating a new udev monitor.
This is just debug stuff, so that if things don't work, a quick debug
run will reveal what is going on.
That said, while we are at it, also fix unexpected closing of passed in
fd when failing.
Possibly fixes#15220. (There might be another leak. I'm still investigating.)
The leak would occur when the path cache was rebuilt. So in normal circumstances
it wouldn't be too bad, since usually the path cache is not rebuilt too often. But
the case in #15220, where new unit files are created in a loop and started, the leak
occurs once for each unit file:
$ for i in {1..300}; do cp ~/.config/systemd/user/test0001.service ~/.config/systemd/user/test$(printf %04d $i).service; systemctl --user start test$(printf %04d $i).service;done