RootDirectory= but via a open_tree() file descriptor. This allows
setting up the execution environment for a service by the client in
a mount namespace and then starting a transient unit in that execution
environment using the new property.
We also add --root-directory= and --same-root-dir= to systemd-run to
have it run services within the given root directory. As systemd-run
might be invoked from a different mount namespace than what systemd is
running in, systemd-run opens the given path with open_tree() and then
sends it to systemd using the new RootDirectoryFileDescriptor= property.
Currently there are various circular dependencies between headers
in core/. Let's get rid of these by making judicious use of forward
declarations and moving includes into implementation files instead of
having them in header files.
Getting rid of circular header includes simplifies the code and makes
various clang based tooling such as iwyu work much better on our code.
The most important change is getting rid of the manager.h include in
unit.h which is possible thanks to the previous commits. We also move
the OOMPolicy and StatusType enums to unit.h to remove the need for
other unit headers to include manager.h to get access to these enums.
Follow-up for 3543456f84
I don't think list is particularly useful here. The passed fds are
constant for the lifetime of service, and with this commit we track
the number of extra fds in a dedicated var anyway.
This adds the ExtraFileDescriptor property to StartTransient dbus API
with format "a(hs)" - array of (file descriptor, name) pairs. The FD
will be passed to the unit via sd_notify like Socket and OpenFile.
systemctl show also shows ExtraFileDescriptorName for these transient
units. We only show the name passed to dbus as the FD numbers will
change once passed over the unix socket and are duplicated, so its
confusing to display the numbers.
We do not add this functionality for systemd-run or general systemd
service units as it is not useful for general systemd services.
Arguably, it could be useful for systemd-run in bash scripts but we
prefer to be cautious and not expose the API yet.
Fixes: #34396
These operations might require slow I/O, and thus might block PID1's main
loop for an undeterminated amount of time. Instead of performing them
inline, fork a worker process and stash away the D-Bus message, and reply
once we get a SIGCHILD indicating they have completed. That way we don't
break compatibility and callers can continue to rely on the fact that when
they get the method reply the operation either succeeded or failed.
To keep backward compatibility, unlike reload control processes, these
are ran inside init.scope and not the target cgroup. Unlike ExecReload,
this is under our control and is not defined by the unit. This is necessary
because previously the operation also wasn't ran from the target cgroup,
so suddenly forking a copy-on-write copy of pid1 into the target cgroup
will make memory usage spike, and if there is a MemoryMax= or MemoryHigh=
set and the cgroup is already close to the limit, it will cause an OOM
kill, where previously it would have worked fine.
One of the major pait points of managing fleets of headless nodes is
that when something fails at startup, unless debug level was already
enabled (which usually isn't, as it's a firehose), one needs to manually
enable it and pray the issue can be reproduced, which often is really
hard and time consuming, just to get extra info. Usually the extra log
messages are enough to triage an issue.
This new option makes it so that when a service fails and is restarted
due to Restart=, log level for that unit is set to debug, so that all
setup code in pid1 and sd-executor logs at debug level, and also a new
DEBUG_INVOCATION=1 env var is passed to the service itself, so that it
knows it should start with a higher log level. Once the unit succeeds
or reaches the rate limit the original level is restored.
Now that we track auto-restarts with a dedicated state,
there's no need for a separate variable for this.
I also took the chance to reorder some struct members.
Follow-up for fc67a943d9
After the mentioned comment, we no longer need to record
the owner to restore the previous bus owner state.
Therefore, bus_name_owner is effectively unused. Kill it.
This refactors the Unit structure a bit: all cgroup-related state fields
are moved to a new structure CGroupRuntime, which is only allocated as
we realize a cgroup.
This is both a nice cleanup and should make unit structures considerably
smaller that have no cgroup associated, because never realized or
because they belong to a unit type that doesn#t have cgroups anyway.
This makes things nicely symmetric:
ExecContext → static user configuration about execution
ExecRuntime → dynamic user state of execution
CGroupContext → static user configuration about cgroups
CGroupRuntime → dynamic user state of cgroups
And each time the XyzContext is part of the unit type structures such as
Service or Slice that need it, but the runtime object is only allocated
when a unit is started.
We cannot determine the SELinux label ahead of time if RootImage= is
used, since we'd have to mount the image then, hence don't, and handle
this cleanly, and gracefully.
While we are at it, stop "reaching over" so much from the socket code to
the service code, and instead provide function that most of the hard
work in service.c that socket.c just calls.
While we are at it, add debug logging and stuff.
I noticed the issue when also noticing #30560, but that one is harder to
fix, hence I avoided it for now.
The first conversion to PidRef. It's mostly an excercise of
search/replace, but with some special care taken for life-cycle (i.e. we
need to destroy the structure properly once done, to release the pidfd).
It also uses pidfd based killing for some of the killing but leaves most
as it is to make the conversion minimal.
When this option is set to direct, the service restarts without entering a failed
state. Dependent units are not notified of transitory failure.
This is useful for the following use case:
We have a target with Requires=my-service, After=my-service.
my-service.service is a oneshot service and has Restart=on-failure in
its definition.
my-service.service can get stuck for various reasons and time out, in
which case it is restarted. Currently, when it fails the first time, the
target fails, even though my-service is restarted.
The behavior we're looking for is that until my-service is not restarted
anymore, the target stays pending waiting for my-service.service to
start successfully or fail without being restarted anymore.
The announced new behavior for OnFailure= never worked properly,
and we've fixed the document instead in #27675.
Therefore, let's get rid of the unused logic completely. More at #27594.
The to-be-added RestartMode= option should cover the use case hopefully.
Closes#27594
Just to match service_release_stdio_fd() and service_release_fd_store()
in the name, since they do similar things.
This follows the concept that we "release" resources, and this is all
generically wrapped in "service_release_resources()".
Oftentimes it is useful to allow the per-service fd store to survive
longer than for a restart. This is useful in various scenarios:
1. An fd to some security relevant object needs to be stashed somewhere,
that should not be cleaned automatically, because the security
enforcement would be dropped then.
2. A user namespace fd should be allocated on first invocation and be
kept around until the user logs out (i.e. systemd --user ends), á la
#16328 (This does not implement what #16318 asks for, but should
solve the use-case discussed there.)
3. There's interest in allow a concept of "userspace reboots" where the
kernel stays running, and userspace is swapped out (i.e. all services
exit, and the rootfs transitioned into a new version of it) while
keeping some select resources pinned, very similar to how we
implement a switch root. Thus it is useful to allow services to exit,
while leaving their fds around till the very end.
This is exposed through a new FileDescriptorStorePreserve= setting that
is closely modelled after RuntimeDirectoryPreserve= (in fact it reused
the same internal type), since we want similar behaviour in the end, and
quite often they probably want to be used together.
Follow-up for #26902 and #26971
Let's always calculate the next restart interval
since that's more useful.
For that, we add 1 to s->n_restarts unconditionally,
and change RestartUSecCurrent property to RestartUSecNext.
When a service deactivates and is then automatically restarted via
Restart= we currently quickly transition through
SERVICE_DEAD/SERVICE_FAILED. Which is weird given it's not the
normal ("permanent") dead/failed state, but a transitory one we
immediately leave from again. We do this so that software that looks for
failures/successes can take notice, even if we restart as a consequence
of the deactivation.
Let's clean this up a bit: let's introduce two new states:
SERVICE_DEAD_BEFORE_AUTO_RESTART and SERVICE_FAILED_BEFORE_AUTO_RESTART
that are used for the transitory states. Both the SERVICE_DEAD and
SERVICE_DEAD_BEFORE_AUTO_RESTART will map to the high-level
UNIT_INACTIVE state though. (and similar for the respective failed
states). This means the high-level state machine won't change by this,
only the low-level one.
This clearly seperates the substates, which makes the state engine
cleaner, and allows clients to follow precisely whether we are in a
transitory dead/failed state, or a permanent one, by looking at the
service substate. Moreover it allows us to remove the 'n_keep_fd_store'
which so far we used to ensure the fdstore was not released during this
transitory dead/failed state but only during the permanent one. Since we
can now distinguish these states properly we can just use that.
This has been bugging me for a while. Let's clean this up.
Note that the unit restart logic is already nicely covered in the
testsiute, hence this adds no new tests for that.
And yes, this could be considered a compat break, but sofar we took the
liberty to make changes to the low-level state machine (i.e. SERVICE_xyz
states, sometimes called "substates") without considering this a bad
breakage – the high-level state machine (i.e. UNIT_xyz states) should
be considered API that cannot be changed.
Currently, exec runtimes can be shared between units (using
JoinsNamespaceOf=). Let's introduce a concept of a private exec
runtime that isn't shared with JoinsNamespaceOf=. The existing
ExecRuntime struct is renamed to ExecRuntimeShared and becomes a
private member of the new private ExecRuntime.
interval between restarts
RestartSteps= accepts a positive integer as the number of steps
to take to increase the interval between auto-restarts from
RestartSec= to RestartSecMax=, or 0 to disable it.
Closes#6129
To notify user of kill events from systemd-oomd we now use
`SERVICE_FAILURE_OOM_KILL` as the failure result.
`unit_check_oomd_kill` now calls `notify_cgroup_oom` to
update the service result to `oom-kill`.
We add a new xattr `user.oomd_ooms` to keep track of the OOM kills
initiated by systemd-oomd, this helps us resolve a race between sending
SIGKILL to processes and checking for OOM kill status from the xattr.
Related to: #20649
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.
Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.
Also-see: https://github.com/systemd/systemd/pull/22038#issuecomment-1042958390
Per-connection socket instances we currently maintain three fields
related to the socket: a reference to the Socket unit, the connection fd,
and a reference to the SocketPeer object that counts socket peers.
Let's synchronize their lifetime, i.e. always set them all three
together or unset them together, so that their reference counters stay
synchronous.
THis will in particuar ensure that we'll drop the SocketPeer reference
whenever we leave an active state of the service unit, i.e. at the same
time we close the fd for it.
Fixes: #20685
This introduces `ExitType=main|cgroup` for services.
Similar to how `Type` specifies the launch of a service, `ExitType` is
concerned with how systemd determines that a service exited.
- If set to `main` (the current behavior), the service manager will consider
the unit stopped when the main process exits.
- The `cgroup` exit type is meant for applications whose forking model is not
known ahead of time and which might not have a specific main process.
The service will stay running as long as at least one process in the cgroup
is running. This is intended for transient or automatically generated
services, such as graphical applications inside of a desktop environment.
Motivation for this is #16805. The original PR (#18782) was reverted (#20073)
after realizing that the exit status of "the last process in the cgroup" can't
reliably be known (#19385)
This version instead uses the main process exit status if there is one and just
listens to the cgroup empty event otherwise.
The advantages of a service with `ExitType=cgroup` over scopes are:
- Integrated logging / stdout redirection
- Avoids the race / synchronisation issue between launch and scope creation
- More extensive use of drop-ins and thus distro-level configuration:
by moving from scopes to services we can have drop ins that will affect
properties that can only be set during service creation,
like `OOMPolicy` and security-related properties
- It makes systemd-xdg-autostart-generator usable by fixing [1], as obviously
only services can be used in the generator, not scopes.
[1] https://bugs.kde.org/show_bug.cgi?id=433299
This reverts commit cb0e818f7c.
After this was merged, some design and implementation issues were discovered,
see the discussion in #18782 and #19385. They certainly can be fixed, but so
far nobody has stepped up, and we're nearing a release. Hopefully, this feature
can be merged again after a rework.
Fixes#19345.
We should test both serialization and deserialization works properly.
But the serialization/deserialization code is deeply entwined with the
manager state, and I think quite a bit of refactoring will be required before
this is possible. But let's at least add this simple test for now.
As suggested in https://github.com/systemd/systemd/pull/11484#issuecomment-775288617.
This does not touch anything exposed in src/systemd. Changing the defines there
would be a compatibility break.
Note that tests are broken after this commit. They will be fixed in the next one.
The usual behaviour when a timeout expires is to terminate/kill the
service. This is what user usually want in production systems. To debug
services that fail to start/stop (especially sporadic failures) it
might be necessary to trigger the watchdog machinery and write core
dumps, though. Likewise, it is usually just a waste of time to
gracefully stop a stuck service. Instead it might save time to go
directly into kill mode.
This commit adds two new options to services: TimeoutStartFailureMode=
and TimeoutStopFailureMode=. Both take the same values and tweak the
behavior of systemd when a start/stop timeout expires:
* 'terminate': is the default behaviour as it has always been,
* 'abort': triggers the watchdog machinery and will send SIGABRT
(unless WatchdogSignal was changed) and
* 'kill' will directly send SIGKILL.
To handle the stop failure mode in stop-post state too a new
final-watchdog state needs to be introduced.
Suppose a service has WatchdogSec set to 2 seconds in its unit file. I
then start the service and WatchdogUSec is set correctly:
% systemctl --user show psi-notify -p WatchdogUSec
WatchdogUSec=2s
Now I call `sd_notify(0, "WATCHDOG_USEC=10000000")`. The new timer seems
to have taken effect, since I only send `WATCHDOG=1` every 4 seconds,
and systemd isn't triggering the watchdog handler. However, `systemctl
show` still shows WatchdogUSec as 2s:
% systemctl --user show psi-notify -p WatchdogUSec
WatchdogUSec=2s
This seems surprising, since this "original" watchdog timer isn't the
one taking effect any more. This patch makes it so that we instead
display the new watchdog timer after sd_notify(WATCHDOG_USEC):
% systemctl --user show psi-notify -p WatchdogUSec
WatchdogUSec=10s
Fixes#15726.
The implementation is pretty straight-foward: when we get a request to
clean some type of resources we fork off a process doing that, and while
it is running we are in the "cleaning" state.