Commit Graph

9638 Commits

Author SHA1 Message Date
Yu Watanabe
3941032c6c journald-audit: do not control kernel auditing by non-default namespace instances by default
The kernel (thus system-wide) auditing should not be controlled by
non-default namespace instances, unless explicitly requested.
2025-07-18 15:27:03 +09:00
ZIHCO
ad6e02e7b4 systemd-analyze: added the verb unit-gdb to spawn and attach gdb 2025-07-17 15:09:58 +01:00
Luca Boccassi
6235121abf netdev-util: allow setting local address based on dhcp-pd addresses as well (#38211)
This extends the functionality introduced in #21648 to allow using
addresses chosen from a delegated prefix as well as the existing
SLAAC/LL/DHCP functionality.
2025-07-17 14:14:49 +01:00
Linus Heckemann
94e5d8b0e0 netdev-util: allow finding addresses from dhcp-pd 2025-07-16 16:17:19 +02:00
Yu Watanabe
dba4fe9a60 quotacheck: add quotacheck.mode credential support 2025-07-16 05:47:38 +09:00
Yu Watanabe
59a6ae4e16 man: fix service names 2025-07-16 05:47:38 +09:00
Yu Watanabe
fff4dcc6de man: fix reference to systemd-quotacheck@.service
Also this makes the man page mentions systemd-quotacheck-root.service.
2025-07-16 05:47:38 +09:00
Yu Watanabe
059afcadfd fsck: add fsck.mode and fsck.repair credentials support
Maybe useful when kernel command line is hard to change, e.g. when UKI
is used.
2025-07-16 05:47:38 +09:00
Luca Boccassi
7ebbe57ece Kill several SysV compat functionalities (v258) (#38178) 2025-07-15 01:21:13 +01:00
Luca Boccassi
6eab4cd44c boot: add LoaderTpm2ActivePcrBanks runtime variable
It turns out checking sysfs is not 100% reliable to figure out whether
the firmware had TPM2 support enabled or not. For example with EDK2 arm64, the
default upstream build config bundles TPM2 support with SecureBoot support,
so if the latter is disabled, TPM2 is also unavailable. But still, the ACPI
TPM2 table is created just as if it was enabled. So /sys/firmware/acpi/tables/TPM2
exists and looks correct, but there are no measurements, neither the firmware
nor the loader/stub can do them, and /sys/kernel/security/tpm0/binary_bios_measurements
does not exist.

The loader can use the apposite UEFI protocol to check, which is a more
definitive answer. Given userspace can also make use of this information, export
the bitmask with the list of active banks as-is. If it's not 0, then we can be
sure a working TPM2 was available in EFI mode.

Partially fixes https://github.com/systemd/systemd/issues/38071
2025-07-14 20:56:22 +01:00
DaanDeMeyer
852de7ed70 nspawn: Prepare --bind-user= logic for reuse in systemd-vmspawn
Aside from the usual boilerplate of moving the shared logic to shared/,
we also rework the implementation of --bind-user= to be similar to what
we'll do in systemd-vmspawn. Instead of messing with the nspawn container
user namespace, we use idmapped mounts to map the user's home directory on
the host to the mapped uid in the container.

Ideally we'd also use the "userdb.transient" credentials to provision the
user records, but this would only work for booted containers, whereas the
current logic works for non-booted containers as well.

Aside from being similar to how we'll implement --bind-user= in vmspawn,
using idmapped mounts also allows supporting --bind-user= without having to
use --private-users=.
2025-07-14 16:25:22 +02:00
Yu Watanabe
e58ba80a40 units: drop runlevel[0-6].target 2025-07-13 05:49:09 +09:00
Yu Watanabe
dc1505555b utmp: drop setting runlevel entry in utmp
This removes systemd-update-utmp-runlevel.service and related command.
2025-07-13 05:49:00 +09:00
Yu Watanabe
8ba48d4bf8 core,initctl,systemctl: kill /dev/initctl support
This also kills support for controlling system state through
/sbin/init, initctl, and telinit.
2025-07-13 05:38:14 +09:00
Yu Watanabe
af925f7eb3 systemctl: kill SysV compat 'runlevel' command 2025-07-13 05:38:13 +09:00
Lennart Poettering
b2f23bd2b1 Support global sysext/confext in systemd-stub/systemd-sysext (#38113)
Systemd-stub supports loading addons, credentials, system and
configuration
extensions from ESP and while addons and credentials can be both global
and
per-UKI, sysext/confext are only per-UKI. 

Add support for global sysext/confext to systemd-stub/systemd-sysext.

Fixes #37993
2025-07-11 21:10:51 +02:00
Lennart Poettering
aac7e892e4 machined: make registration of unpriv user's VMs/containers work (#37855)
This adds missing glue to reasonably allow unpriv users VMs/containers
to register with the system machined.

This primarily adds two things:

1. machined can now properly track VMs/containers residing in subcgroups
of units, because that's effectively what happens for per-user
VMs/containers: they are placed below the system unit `user@….service`
in some user unit.

2. machines registered with machined now have an owning UID: users can
operate on their own machines withour re-authentication, but not on
others.

Note that this is only a first step regarding machined's hookup of
nspawn/vmspawn in the long run for unpriv operation.

I think eventually we should make it so that there's both a per-user and
a per-system machined instance (so far, and even with this PR there's
still one per-system instance), and per-user containers/VMs would
registering with *both*. Having two instances makes sense I think,
because it would mean we can make machined reasonably manage the
per-user image discovery, and also do the per-system network/hostname
handling.
2025-07-11 21:10:08 +02:00
Lennart Poettering
f820b27565 vmspawn: introduce --notify-ready= switch
This mimics the switch of the same name from nspawn: it controls whether
we expect a READY=1 message from the payload or not. Previously we'd
always expect that. This makes it configurable, just like it is in
nspawn.

There's one fundamental difference in behaviour though: in nspawn it
defaults to off, in vmspawn it defaults to on. (for historical reasons,
ideally we'd default to on in both cases, but changing is quite a compat
break both directly and indirectly: since timeouts might get triggered).
2025-07-11 18:17:04 +02:00
Lennart Poettering
0fc45c8d20 vmspawn: substantially beef up cgroup logic, to match more closely what nspawn does
This beefs up the cgroup logic, adding --slice=, --property= to vmspawn
the same way it already exists in nspawn.

There are a bunch of differences though: we don't delegate the cgroup
access in the allocated unit (since qemu wouldn't need that), and we do
registration via varlink not dbus. Hence, while this follows a similar
logic now, it differs in a lot of details.

This makes in particular one change: when invoked on the command line
we'll only add the qemu instance to the allocated scope, not the vmspawn
process itself (this follows more closely how nspawn does this where
only the container payload has its scope, not nspawn itself). This is
quite tricky to implement: unlike in nspawn we have auxiliary services
to start, with depencies to the scope. This means we need to start the
scope early, so that we know the scope's name. But the command line to
invoke is only assembled from the data we learn about the auxiliary
services, hence much later. To addres we'll now fork off the child that
eventually will become early, then move it to a scope, prepare the
cmdline and then very late send the cmdline (and the fds we want to
pass) to the prepared child, which then execs it.
2025-07-11 18:17:04 +02:00
Lennart Poettering
97754cd14d machined: also track 'supervisor' process of a machine
So far, machined strictly tracked the "leader" process of a machine,
i.e. the topmost process that is actually the payload of the machine.
Its runtime also defines the runtime of the machine, and we can directly
interact with it if we need to, for example for containers to join the
namespaces, or kill it.

Let's optionally also track the "supervisor" process of a machine, i.e.
the host process that manages the payload if there is one. This is
generally useful info, but in particular is useful because we might need
to communicate with it to shutdown a machine without cooperation of the
payload. Traditionally we did this by simply stopping the unit of the
machine, but this is not doable now that the host machined can be used
to track per-user machines.

In the long run we probably want a more bespoke protocol between
machined and supervisors (so that we can execute other commands too,
such as request cooperative reboots/shutdowns), but that's for later.

Some environments call the concept "monitor" rather than "supervisor" or
use some other term. I stuck to "supervisor" because nspawn uses this,
and ultimately one name is as good as another.

And of course, in other implementations of VM managers of containers
there might not be a single process tracking each VM/container. Because
of this, the concept of a supervisor is optional.
2025-07-11 18:15:12 +02:00
Lennart Poettering
276d200186 machined: track UID owner of machines
Now that unpriv clients can register machines, let's register their UID
too. This allows us to do two things:

1. make sure the scope delegation is assigned to the right UID (so that
   the unpriv user can actually create cgroups below the delegated
   scope)

2. permit certain types of access (i.e. killing, or pty access) to the
   client without auth if it owns the machine.
2025-07-11 18:15:12 +02:00
Lennart Poettering
d5feeb373c machined: optionally track machines in cgroup subgroups 2025-07-11 18:15:12 +02:00
Yu Watanabe
fabcb1eb06 man: fix version info tag
Follow-up for 63770fa1d3.
2025-07-11 14:33:25 +02:00
Vitaly Kuznetsov
9f7e3820e9 stub: Support global sysext/confext
Systemd-stub support loading addons, credentials, system and configuration
extensions from ESP and while addons and credentials can be both global and
per-UKI, sysext/confext are only per-UKI.

Add support for loading ESP/loader/credentials/*.{sysext,confext}.raw to
systemd-stub.

Note: for backwards compatibility reasons, per-UKI sysexts can also be
*.raw (not only *.sysext.raw) but as global extensions are new, there's
no need to bring this legacy there.
2025-07-11 13:08:15 +02:00
Yu Watanabe
cc01ee7871 kernel-install: several follow-ups for --entry-type= (#38160)
Follow-ups for b6d4997683 (#37897).
2025-07-11 20:07:19 +09:00
Zbigniew Jędrzejewski-Szmek
63770fa1d3 systemd-run: add --no-pager, use pager for --help 2025-07-11 19:01:42 +09:00
Yu Watanabe
a87b6c2c5a man/kernel-install: mention --entry-type= option in the man page
Follow-up for b6d4997683.
2025-07-11 17:32:04 +09:00
Yu Watanabe
4d7851380a Cleanups for missing_xyz.h headers (#37904)
Continuation of #37960.

The same concern as expalined in #37960 exists also in
missing_syscall.h. If we use enough new glibc, a function we want to use
may be already provided by glibc, but our baseline glibc may not. And it
is hard to detect in our daily development.

This moves all prototypes of syscalls to relevant headers, and missing
syscall functions are defined in relevant .c files of libc wrapper. This
way, we can use usual header as is, e.g. when we want to write code with
`move_mount()`, we can simply use sys/mount.h without checking if it is
supported by our baseline glibc.
2025-07-11 15:20:10 +09:00
Yu Watanabe
369f311686 man: fix typo
Follow-up for 7aefb194e7.
2025-07-11 14:11:04 +09:00
Yu Watanabe
2b912d2066 tree-wide: several cleanups for generating symbol lists and gperf files
- pass our system include directories to make generators use our libc
  wrappers and latest kernel headers,
- include relevant headers in generated gperf file,
- use files() rather than find_program(), as the result of
  find_program() cannot be passed to 'input' of custom_target(),
- move generate-bpf-delegate-configs.py to src/core/, as it is only used
  by libcore.
2025-07-11 13:05:42 +09:00
Yu Watanabe
1a60b97524 include: move libc header wrappers to src/include/override/, and kernel headers to src/include/uapi/
Preparation for later changes.
2025-07-11 12:44:26 +09:00
Matteo Croce
7aefb194e7 man/systemd.exec: explain how BPF token works
Add a small paragraph explaining how BPF token works, how it's being
created and its relationship between the BPF filesystem.
Move all the relevant documentation in the PrivateBPF= section and let
point all the BPFDelegate* options to that one.
2025-07-10 21:40:07 +02:00
Ubuntu
df5b3426f6 journald: support reloading configuration at runtime 2025-07-10 21:38:36 +02:00
DaanDeMeyer
cc43510a13 userdb: Add userdb.transient credentials
To implement --bind-user in systemd-vmspawn, we need a transient
version of these credentials. These are useful when the home directory
of the user is mounted into the container/vm and every trace of the user
will be (mostly) gone again when the container/vm is shut down.
2025-07-10 21:36:09 +02:00
Christian Hesse
8dfe176adc man: clean up list of literals 2025-07-10 15:23:56 +09:00
Yu Watanabe
f436c64e61 man: fix typo
Follow-up for 7baf403430.
2025-07-10 14:02:00 +09:00
Lennart Poettering
03b4a607f6 core: followups for the recent subgroup killing commits
This is a follow-up for 0f23564ad4 and
6b02854f50, as suggested here:

https://github.com/systemd/systemd/pull/37855#pullrequestreview-2997596953
2025-07-10 13:32:51 +09:00
Yu Watanabe
1cf5b39d64 core: add 'DefaultRestrictSUIDSGID' config option (#38126)
closes #37602, see there for extra motivation and considered
alternatives.

On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.

## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
    - VM system boots successfully
    - defaults apply (both `yes`, `no`, and undefined)
    - systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
2025-07-10 13:30:07 +09:00
Matteo Croce
7baf403430 man/systemd.exec: update documentation for PrivateBPF=
Add a short description about what PrivateBPF=yes does
and how it can be useful.
2025-07-10 01:57:14 +02:00
Grimmauld
0316fb8219 core: document 'DefaultRestrictSUIDSGID' 2025-07-09 21:45:46 +02:00
Grimmauld
97998d1cbe core/dbus-manager: Support 'DefaultRestrictSUIDSGID' option 2025-07-09 21:45:38 +02:00
Matteo Croce
ea9826eb94 core: add options to delegate BPFFS token creation
Add four new options BPFDelegate{Commands,Maps,Programs,Attachments}=
in order to delegate to a BPFFS instance the permission to create tokens.

The value is a list of options taken from:
https://github.com/torvalds/linux/blob/v6.14/include/uapi/linux/bpf.h#L922-L1121
The special value "any" means to allow every possible values.

More informations about BPF tokens here:
https://lwn.net/Articles/947173/
2025-07-08 22:35:29 +02:00
Matteo Croce
3a47437fc9 core: Introduce PrivateBPF= to mount a private BPFFS
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
2025-07-08 22:33:28 +02:00
Yu Watanabe
293cc8866d man: mention relative PIDFile= in user service is prefixed with $XDG_RUNTIME_DIR 2025-07-08 18:02:38 +09:00
Lennart Poettering
6b02854f50 systemctl: add --kill-subgroup= switch for killing subcgroup 2025-07-08 03:14:53 +02:00
Lennart Poettering
0f23564ad4 pid1: add ability to kill processes in a subgroup of a unit
This is useful for things like machined, where the system machined wants
to manage a machine owned by the user somewhere down the tree.
2025-07-08 03:14:53 +02:00
Yu Watanabe
3ef791876b core: add quota support for State, Cache, and Log exec directories (#35892)
Based on https://github.com/systemd/systemd/issues/7820, this adds support for
quota enforcement to State, Cache, and Log exec directories.
* Add new directives, StateDirectoryQuota=, CacheDirectoryQuota=, and
  LogDirectoryQuota=, to define quotas as percentages (hard limits for
  blocks and inodes) or absolute values (hard limits for blocks only).
* Add new directives, StateDirectoryQuotaAccounting=,
  CacheDirectoryQuotaAccounting= and LogDirectoryQuotaAccounting= to keep
  track of storage quotas but not enforce them (effectively just assigning
  a project ID to defined exec directories).

Example:
```
StateDirectory=quotadir
StateDirectoryQuota=1%

Jan 06 22:55:46 abeltran: Storage quotas set for /var/lib/private/quotadir. Block limit = 2639404, inode limit = 671088

root@abeltran:/var/lib/private# lsattr -pR
3153000189 --------------e----P-- ./quotadir

root@abeltran:/var/lib/private# repquota  -P /datadrive
*** Report for project quotas on device /dev/sdc1
Block grace time: 7days; Inode grace time: 7days
                        Block limits                File limits
Project         used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
#0        --  213200       0       0           4086     0     0         
#3153000189 -- 2639404       0 2639404              2     0 671088   
```
2025-07-08 09:18:20 +09:00
Lennart Poettering
bb176bdb51 man: also use title case in systemd.service(5)
Follow-up for: 172dd81e92
2025-07-08 09:05:58 +09:00
Andres Beltran
26c6f3271a core: add quota support for State, Cache, and Log exec directories 2025-07-07 17:28:47 +00:00
Mike Yuan
24e67cea45 man/supported-controllers: refresh list 2025-07-07 17:54:38 +02:00