Before calling io.systemd.MachineImage.List.
The systemd-nspawn process takes a lock in the run() function in
nspawn.c and holds it for the entire runtime of that function. If we
call `machinectl terminate` the machine gets unregistered _before_ we
release the lock, so the original `machinectl status` check would return
early, allowing for a race where we call io.systemd.MachineImage.List
over Varlink when systemd-nspawn still holds the lock because the
process is still running.:
```
[ 41.691826] TEST-13-NSPAWN.sh[1102]: + machinectl terminate long-running
[ 41.695009] systemd-nspawn[2171]: Trying to halt container by sending TERM to container PID 1. Send SIGTERM again to trigger immediate termination.
[ 41.698235] systemd-machined[1192]: Machine long-running terminated.
[ 41.709520] TEST-13-NSPAWN.sh[1102]: + systemctl kill --signal=KILL systemd-nspawn@long-running.service
[ 41.709169] systemd-nspawn[2171]: Failed to unregister machine: No machine 'long-running' known
[ 41.720869] TEST-13-NSPAWN.sh[2346]: + varlinkctl --more call /run/systemd/machine/io.systemd.MachineImage io.systemd.MachineImage.List '{}'
[ 41.723359] TEST-13-NSPAWN.sh[2347]: + grep long-running
...
[ 41.735453] TEST-13-NSPAWN.sh[2352]: + varlinkctl call /run/systemd/machine/io.systemd.MachineImage io.systemd.MachineImage.List '{"name":"long-running", "acquireMetadata": "yes"}'
[ 41.736222] TEST-13-NSPAWN.sh[2353]: + grep OSRelease
[ 41.739500] TEST-13-NSPAWN.sh[2352]: Method call io.systemd.MachineImage.List() failed: Device or resource busy
[ 41.740641] systemd[1]: Received SIGCHLD.
[ 41.740670] systemd[1]: Child 2171 (systemd-nspawn) died (code=killed, status=9/KILL)
[ 41.740725] systemd[1]: systemd-nspawn@long-running.service: Child 2171 belongs to systemd-nspawn@long-running.service.
[ 41.740748] systemd[1]: systemd-nspawn@long-running.service: Main process exited, code=killed, status=9/KILL
[ 41.740755] systemd[1]: systemd-nspawn@long-running.service: Will spawn child (service_enter_stop_post): systemd-nspawn
[ 41.740872] systemd[1]: systemd-nspawn@long-running.service: About to execute: systemd-nspawn --cleanup --machine=long-running
...
```
Let's mitigate this by waiting until the corresponding
systemd-nspawn@.service instance enters the 'inactive' state where the
lock should be properly released.
Resolves: https://github.com/systemd/systemd/issues/39547
It should exit on its own anyway and this will work even if the job has
already finished* (unlike kill).
[*] assuming job control is off, as it's the case when running the
test suite
Resolves: #39543
Before calling io.systemd.MachineImage.List.
The systemd-nspawn process takes a lock in the run() function in
nspawn.c and holds it for the entire runtime of that function. If we
call `machinectl terminate` the machine gets unregistered _before_ we
release the lock, so the original `machinectl status` check would return
early, allowing for a race where we call io.systemd.MachineImage.List
over Varlink when systemd-nspawn still holds the lock because the
process is still running.:
[ 41.691826] TEST-13-NSPAWN.sh[1102]: + machinectl terminate long-running
[ 41.695009] systemd-nspawn[2171]: Trying to halt container by sending TERM to container PID 1. Send SIGTERM again to trigger immediate termination.
[ 41.698235] systemd-machined[1192]: Machine long-running terminated.
[ 41.709520] TEST-13-NSPAWN.sh[1102]: + systemctl kill --signal=KILL systemd-nspawn@long-running.service
[ 41.709169] systemd-nspawn[2171]: Failed to unregister machine: No machine 'long-running' known
[ 41.720869] TEST-13-NSPAWN.sh[2346]: + varlinkctl --more call /run/systemd/machine/io.systemd.MachineImage io.systemd.MachineImage.List '{}'
[ 41.723359] TEST-13-NSPAWN.sh[2347]: + grep long-running
...
[ 41.735453] TEST-13-NSPAWN.sh[2352]: + varlinkctl call /run/systemd/machine/io.systemd.MachineImage io.systemd.MachineImage.List '{"name":"long-running", "acquireMetadata": "yes"}'
[ 41.736222] TEST-13-NSPAWN.sh[2353]: + grep OSRelease
[ 41.739500] TEST-13-NSPAWN.sh[2352]: Method call io.systemd.MachineImage.List() failed: Device or resource busy
[ 41.740641] systemd[1]: Received SIGCHLD.
[ 41.740670] systemd[1]: Child 2171 (systemd-nspawn) died (code=killed, status=9/KILL)
[ 41.740725] systemd[1]: systemd-nspawn@long-running.service: Child 2171 belongs to systemd-nspawn@long-running.service.
[ 41.740748] systemd[1]: systemd-nspawn@long-running.service: Main process exited, code=killed, status=9/KILL
[ 41.740755] systemd[1]: systemd-nspawn@long-running.service: Will spawn child (service_enter_stop_post): systemd-nspawn
[ 41.740872] systemd[1]: systemd-nspawn@long-running.service: About to execute: systemd-nspawn --cleanup --machine=long-running
...
Let's mitigate this by waiting until the corresponding
systemd-nspawn@.service instance enters the 'inactive' state where the
lock should be properly released.
Resolves: #39547
Commit 38748596f0 ("core: Make DelegateNamespaces= work for user
managers with CAP_SYS_ADMIN") refactored the logic for when an
unprivileged process should create a new user namespace for sandboxing.
This refactor inadvertently removed a check (`params->runtime_scope !=
RUNTIME_SCOPE_USER`) that differentiated between system services and user
services.
This causes a regression in rootless containers where systemd runs
unprivileged. When starting a system service (like `dbus-broker`) that
uses sandboxing features (eg. with `PrivateTmp=yes`), systemd now
incorrectly creates a new, minimal `PRIVATE_USERS_SELF` namespace.
This new namespace only maps UID/GID 0. When dbus-broker attempts to
drop privileges to the `dbus` user (GID 81), the `setresgid(81, 81, 81)`
call fails because GID 81 is not mapped.
Restore the check to ensure that the special unprivileged sandboxing
logic is only applied to user services, as was the original intent.
System services in a rootless context will now correctly run in the
container's main user namespace, where all necessary UIDs/GIDs are
mapped.
Fixes: https://github.com/systemd/systemd/issues/39563
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2391343
There might be a delay between an umount and a refcounted device
to disappear, so the test can be flaky:
[ 36.107128] TEST-50-DISSECT.sh[1662]: ++ dmsetup ls
[ 36.108314] TEST-50-DISSECT.sh[1663]: ++ grep loop
[ 36.109283] TEST-50-DISSECT.sh[1664]: ++ grep -c verity
[ 36.110284] TEST-50-DISSECT.sh[1360]: + test 1 -eq 1
[ 36.111555] TEST-50-DISSECT.sh[1360]: + umount -R /tmp/TEST-50-IMAGES.hxm/mount
[ 36.112237] TEST-50-DISSECT.sh[1668]: ++ dmsetup ls
[ 36.113039] TEST-50-DISSECT.sh[1669]: ++ grep loop
[ 36.113833] TEST-50-DISSECT.sh[1670]: ++ grep -c verity
[ 36.114517] TEST-50-DISSECT.sh[1360]: + test 0 -eq 1
[ 36.116734] TEST-50-DISSECT.sh[1000]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/TEST-50-DISSECT.dissect.sh failed'
https://github.com/systemd/systemd/actions/runs/19062162467/job/54444112653?pr=39540#logs
Switch to searching for the dm entry and check for it specifically,
and wait for it to disappear before checking that it is no longer
in the dm table.
Follow-up for 10fc43e504
Mixing the `--unit` and `--user-unit` options will result in error messages.
During the parsing phase, only the `arg_show_unit` record of the last
occurrence of the option is used; all names are placed in the same `arg_names`,
thus mixing the two types of units in the query.
For example, `-u foo --user-unit bar` will also treat `foo` as a user unit and
query it in the user service.
When parsing an absolute time specification like `hh:mm` for the
`shutdown` command, the code interprets a time in the past as "tomorrow
at this time". It currently implements this by adding a fixed 24-hour
duration (`USEC_PER_DAY`) to the timestamp.
This assumption breaks across DST transitions, as the day might not be
24 hours long. This can cause the shutdown to be scheduled at the wrong
time (typically off by one hour in either direction).
Change the logic to perform calendar arithmetic instead of timestamp
arithmetic. If the calculated time is in the past, we increment
`tm.tm_mday` and call `mktime_or_timegm_usec()` a second time.
This delegates all date normalization logic to `mktime()`, which
correctly handles all edge cases, including DST transitions, month-end
rollovers, and leap years.
Fixes: https://github.com/systemd/systemd/issues/39232
TEST-07-PID.user-namespace-path.sh is flaky as Type=simple is used
(implicitly), explicitly use Type=exec instead to ensure the namespaces
are created before starting another service reusing the same namespaces.
Fixes#39546.
The extractor already deals with sparse files properly (because
archive_read_data_into_fd() does).
Let's also make sure the archiver also does this, and attaches the
necessary sparse file metadata to each file.
Both sysext and confext used the host's /etc/initrd-release file even
when --root=/somewhere was specified. A workaround was the
SYSTEMD_IN_INITRD= env var but without knowing this it was quite
confusing. Aside from users validating their extensions, the primary
use case for this to matter is when the extensions are set up from the
initrd where the initrd-release file is present when running but we want
to prepare the extensions for the final system and thus should match
for the right scope.
Make systemd-sysext check for /etc/initrd-release inside the given
--root= tree. An alternative would be to always ignore the
initrd-release check when --root= is passed but this way it is more
consistent. The image policy logic for EFI-loader-passed extensions
won't take effect when --root= is used, though.
The last sysext test leaked things into new tests added later,
uncovered by any new tests leftover check.
Remove the mutable folder state through a trap as done in other tests.
In many cases we want to expose enums for which we have the usual
xyz_to_string()/xyz_from_string() via Varlink as enums. Let's add some
infra to test the tables against each other, to automatically detect
when they deviate.
In order to implement this properly, let's export/introduce clean
json_underscorefy()/json_dashify(), for dealing with the fact that our
enums usually use dash separates ames, but Varlink doesn't allow that.
(This does not add the test cases for all enum types we expose right
now, but only adds the general infra).
When Type=notify-reload got introduced, it wasn't intended to be
mutually exclusive with ExecReload=. However, currently ExecReload=
is immediately forked off after the service main process is signaled,
leaving states in between essentially undefined. Given so broken
it is I doubt any sane user is using this setup, hence I took a stab
to rework everything:
1. Extensions are refreshed (unchanged)
2. ExecReload= is forked off without signaling the process
3a. If RELOADING=1 is sent during the ExecReload= invocation,
we'd refrain from signaling the process again, instead
just transition to SERVICE_RELOAD_NOTIFY directly and
wait for READY=1
3b. If not, signal the process after ExecReload= finishes
(from now on the same as Type=notify-reload w/o ExecReload=)
4. To accomodate the use case of performing post-reload tasks,
ExecReloadPost= is introduced which executes after READY=1
The new model greatly simplifies things, as no control processes
will be around in SERVICE_RELOAD_SIGNAL and SERVICE_RELOAD_NOTIFY
states.
See also: https://github.com/systemd/systemd/issues/37515#issuecomment-2891229652