Commit Graph

186 Commits

Author SHA1 Message Date
Lennart Poettering
de90700f36 pid1: set SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       https://github.com/systemd/systemd/pull/22038

Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
2022-02-18 10:49:36 +01:00
Luca Boccassi
a07b992606 core: add ExtensionDirectories= setting
Add a new setting that follows the same principle and implementation
as ExtensionImages, but using directories as sources.
It will be used to implement support for extending portable images
with directories, since portable services can already use a directory
as root.
2022-01-21 22:53:12 +09:00
Peter Morrow
cdebedb4d4 service: pass service exit status to spawned On{Failure,Success}= dependency
When a service exits and triggers either an OnFailure= or OnSuccess=
dependency we now set a new environment variable for the ExecStart= and
ExecStartPre= process. This variable $MONITOR_METADATA exposes the
metadata relating to the service which triggered the dependency.
MONITOR_METADATA takes the following form:

MONITOR_METADATA="SERVICE_RESULT=<result-string0>,EXIT_CODE=<exit-code0>,EXIT_STATUS=<exit-status0>,INVOCATION_ID=<id>,UNIT=<triggering-unit0.service>;SERVICE_RESULT=<result-stringN>,EXIT_CODE=<exit-codeN>,=EXIT_STATUS=<exit-statusN>,INVOCATION_ID=<id>,UNIT=<triggering-unitN.service>"

MONITOR_METADATA is space separated set of metadata relating to the
service(s) which triggered the dependency. This is a list since if we
have 2 services which trigger the same dependency then the dependency
start job may be merged. In this case we need to pass both service
metadata to the triggered service. If there is no job merging then
MONITOR_METADATA will be a single entry.

For example, in the case we had a service "failer.service" which
triggers "failer-handler.service", the following variable is exported to
the ExecStart= and ExecStartPre= processes in failer-handler.service:

MONITOR_METADATA="SERVICE_RESULT=exit-code,EXIT_CODE=exited,EXIT_STATUS=1,INVOCATION_ID=67c657ed7b34466ea369abdf994c6393,UNIT=failer.service"

In another example where we have failer.service and failer2.service
which both also trigger failer-handler.service then the start job for
failer-handler.service may be merged and we might get the following:

MONITOR_METADATA="SERVICE_RESULT=exit-code,EXIT_CODE=exited,EXIT_STATUS=1,INVOCATION_ID=16a93ad196c94109990fb8b9aa5eef5f,UNIT=failer.service;SERVICE_RESULT=exit-code,EXIT_CODE=exited,EXIT_STATUS=1,INVOCATION_ID=ff70131e4cc145e994fb621de25a3e8f,UNIT=failer2.service"
2021-12-13 11:25:57 +00:00
Daan De Meyer
51462135fb exec: Add TTYRows and TTYColumns properties to set TTY dimensions 2021-11-05 21:32:14 +00:00
Luca Boccassi
211a3d87fb core: add [State|Runtime|Cache|Logs]Directory symlink as second parameter
When combined with a tmpfs on /run or /var/lib, allows to create
arbitrary and ephemeral symlinks for StateDirectory or RuntimeDirectory.
This is especially useful when sharing these directories between
different services, to make the same state/runtime directory 'backend'
appear as different names to each service, so that they can be added/removed
to a sharing agreement transparently, without code changes.

An example (simplified, but real) use case:

foo.service:
StateDirectory=foo

bar.service:
StateDirectory=bar

foo.service.d/shared.conf:
StateDirectory=
StateDirectory=shared:foo

bar.service.d/shared.conf:
StateDirectory=
StateDirectory=shared:bar

foo and bar use respectively /var/lib/foo and /var/lib/bar. Then
the orchestration layer decides to stop this sharing, the drop-in
can be removed. The services won't need any update and will keep
working and being able to store state, transparently.

To keep backward compatibility, new DBUS messages are added.
2021-10-28 10:47:46 +01:00
Iago Lopez Galeiras
b1994387d3 core: use LSM BPF functions to implement RestrictFileSystems=
It attaches the LSM BPF program when the system manager starts up.

It populates the hash of maps BPF map when services that have
RestrictFileSystems= set start.

It cleans up the hash of maps when the unit cgroup is pruned.

To pass the file descriptor of the BPF map we add it to the keep_fds
array.
2021-10-06 10:52:14 +02:00
alexlzhu
8c35c10d20 core: Add ExecSearchPath parameter to specify the directory relative to which binaries executed by Exec*= should be found
Currently there does not exist a way to specify a path relative to which
all binaries executed by Exec should be found. The only way is to
specify the absolute path.

This change implements the functionality to specify a path relative to which
binaries executed by Exec*= can be found.

Closes #6308
2021-09-28 14:52:27 +01:00
Lennart Poettering
43144be4a1 pid1: add support for encrypted credentials 2021-07-08 09:30:56 +02:00
Xℹ Ruoyao
a70581ffb5 New directives PrivateIPC and IPCNamespacePath 2021-03-04 00:04:36 +08:00
Luca Boccassi
93f597013a Add ExtensionImages directive to form overlays
Add support for overlaying images for services on top of their
root fs, using a read-only overlay.
2021-02-23 15:34:46 +00:00
Zbigniew Jędrzejewski-Szmek
2d93c20e5f tree-wide: use -EINVAL for enum invalid values
As suggested in https://github.com/systemd/systemd/pull/11484#issuecomment-775288617.

This does not touch anything exposed in src/systemd. Changing the defines there
would be a compatibility break.

Note that tests are broken after this commit. They will be fixed in the next one.
2021-02-10 14:46:59 +01:00
Topi Miettinen
ddc155b2fd New directives NoExecPaths= ExecPaths=
Implement directives `NoExecPaths=` and `ExecPaths=` to control `MS_NOEXEC`
mount flag for the file system tree. This can be used to implement file system
W^X policies, and for example with allow-listing mode (NoExecPaths=/) a
compromised service would not be able to execute a shell, if that was not
explicitly allowed.

Example:
[Service]
NoExecPaths=/
ExecPaths=/usr/bin/daemon /usr/lib64 /usr/lib

Closes: #17942.
2021-01-29 12:40:52 +00:00
Lennart Poettering
3bdc25a4cf core: make NotifyAccess= in combination with RootDirectory=/RootImage= work
Previously if people enabled RootDirectory=/RootImage= and NotifyAccess=
together, things wouldn't work, they'd have to explicitly add
BindReadOnlyPaths=/run/systemd/notify too.

Let's make this implicit. Since both options are opt-in, if people use
them together it would be pointless not also defining the
BindReadOnlyPaths= entry, in which case we can just do it automatically.

See: #18051
2021-01-20 22:39:07 +01:00
Luca Boccassi
5e8deb94c6 core: add DBUS method to bind mount new nodes without service restart
Allow to setup new bind mounts for a service at runtime (via either
DBUS or a new 'systemctl bind' verb) with a new helper that forks into
the unit's mount namespace.
Add a new integration test to cover this.

Useful for zero-downtime addition to services that are running inside
mount namespaces, especially when using RootImage/RootDirectory.

If a service runs with a read-only root, a tmpfs is added on /run
to ensure we can create the airlock directory for incoming mounts
under /run/host/incoming.
2021-01-18 17:24:05 +00:00
Lucas Werkmeister
8d7dab1fda Add truncate: to StandardOutput= etc.
This adds the ability to specify truncate:PATH for StandardOutput= and
StandardError=, similar to the existing append:PATH. The code is mostly
copied from the related append: code. Fixes #8983.
2021-01-15 09:54:50 +01:00
Yu Watanabe
db9ecf0501 license: LGPL-2.1+ -> LGPL-2.1-or-later 2020-11-09 13:23:58 +09:00
Zbigniew Jędrzejewski-Szmek
fcb7138ca7 test-path: do not fail the test if we fail to start a service because of cgroup setup
The test was failing because it couldn't start the service:

path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
Failed to connect to system bus: No such file or directory
-.slice: Failed to enable/disable controllers on cgroup /system.slice/kojid.service, ignoring: Permission denied
path-modified.service: Failed to create cgroup /system.slice/kojid.service/path-modified.service: Permission denied
path-modified.service: Failed to attach to cgroup /system.slice/kojid.service/path-modified.service: No such file or directory
path-modified.service: Failed at step CGROUP spawning /bin/true: No such file or directory
path-modified.service: Main process exited, code=exited, status=219/CGROUP
path-modified.service: Failed with result 'exit-code'.
Test timeout when testing path-modified.path

In fact any of the services that we try to start may fail, especially
considering that we're doing some rogue cgroup operations. See
https://github.com/systemd/systemd/pull/16603#issuecomment-679133641.
2020-10-22 11:05:17 +02:00
Lennart Poettering
74e1252072 execute: add helper for checking if root_directory/root_image are set in ExecContext 2020-10-01 11:02:11 +02:00
Zbigniew Jędrzejewski-Szmek
5e98086d16 core: remember when we set ExecContext.mount_apivfs
No functional change intended so far.
2020-09-24 10:03:18 +02:00
Topi Miettinen
9df2cdd8ec exec: SystemCallLog= directive
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.

---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
2020-09-15 12:54:17 +03:00
Lennart Poettering
bb0c0d6f29 core: add credentials logic
Fixes: #15778 #16060
2020-08-25 19:45:35 +02:00
Lennart Poettering
4e39995371 core: introduce ProtectProc= and ProcSubset= to expose hidepid= and subset= procfs mount options
Kernel 5.8 gained a hidepid= implementation that is truly per procfs,
which allows us to mount a distinct once into every unit, with
individual hidepid= settings. Let's expose this via two new settings:
ProtectProc= (wrapping hidpid=) and ProcSubset= (wrapping subset=).

Replaces: #11670
2020-08-24 20:11:02 +02:00
Luca Boccassi
b3d133148e core: new feature MountImages
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.

Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
2020-08-05 21:34:55 +01:00
Luca Boccassi
18d7370587 service: add new RootImageOptions feature
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:

RootImageOptions=1:ro,dev 2:nosuid nodev

In absence of a partition number, 0 is assumed.
2020-07-29 17:17:32 +01:00
Zbigniew Jędrzejewski-Szmek
56a13a495c pid1: create ro private tmp dirs when /tmp or /var/tmp is read-only
Read-only /var/tmp is more likely, because it's backed by a real device. /tmp
is (by default) backed by tmpfs, but it doesn't have to be. In both cases the
same consideration applies.

If we boot with read-only /var/tmp, any unit with PrivateTmp=yes would fail
because we cannot create the subdir under /var/tmp to mount the private directory.
But many services actually don't require /var/tmp (either because they only use
it occasionally, or because they only use /tmp, or even because they don't use the
temporary directories at all, and PrivateTmp=yes is used to isolate them from
the rest of the system).

To handle both cases let's create a read-only directory under /run/systemd and
mount it as the private /tmp or /var/tmp. (Read-only to not fool the service into
dumping too much data in /run.)

$ sudo systemd-run -t -p PrivateTmp=yes bash
Running as unit: run-u14.service
Press ^] three times within 1s to disconnect TTY.
[root@workstation /]# ls -l /tmp/
total 0
[root@workstation /]# ls -l /var/tmp/
total 0
[root@workstation /]# touch /tmp/f
[root@workstation /]# touch /var/tmp/f
touch: cannot touch '/var/tmp/f': Read-only file system

This commit has more changes than I like to put in one commit, but it's touching all
the same paths so it's hard to split.
exec_runtime_make() was using the wrong cleanup function, so the directory would be
left behind on error.
2020-07-14 19:47:15 +02:00
Luca Boccassi
d4d55b0d13 core: add RootHashSignature service parameter
Allow to explicitly pass root hash signature as a unit option. Takes precedence
over implicit checks.
2020-06-25 08:45:21 +01:00
Lennart Poettering
6b000af4f2 tree-wide: avoid some loaded terms
https://tools.ietf.org/html/draft-knodel-terminology-02
https://lwn.net/Articles/823224/

This gets rid of most but not occasions of these loaded terms:

1. scsi_id and friends are something that is supposed to be removed from
   our tree (see #7594)

2. The test suite defines an API used by the ubuntu CI. We can remove
   this too later, but this needs to be done in sync with the ubuntu CI.

3. In some cases the terms are part of APIs we call or where we expose
   concepts the kernel names the way it names them. (In particular all
   remaining uses of the word "slave" in our codebase are like this,
   it's used by the POSIX PTY layer, by the network subsystem, the mount
   API and the block device subsystem). Getting rid of the term in these
   contexts would mean doing some major fixes of the kernel ABI first.

Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
2020-06-25 09:00:19 +02:00
Luca Boccassi
0389f4fa81 core: add RootHash and RootVerity service parameters
Allow to explicitly pass root hash (explicitly or as a file) and verity
device/file as unit options. Take precedence over implicit checks.
2020-06-23 10:50:09 +02:00
Lennart Poettering
f3dc6af20f core: automatically update StandardOuput=syslog to =journal (and similar for StandardError=)
Let's go one step further and upgrade implicitly. Usually =syslog
assignments are historic artifacts only. Let's upgrade the lines
automatically, and politely suggest people update their unit
files/configuration (and drop the lines altogether, without
replacement).

Fixes: #15807
2020-05-15 00:05:46 +02:00
Zbigniew Jędrzejewski-Szmek
ad21e542b2 manager: add CoredumpFilter= setting
Fixes #6685.
2020-04-09 14:08:48 +02:00
Michal Sekletár
e2b2fb7f56 core: add support for setting CPUAffinity= to special "numa" value
systemd will automatically derive CPU affinity mask from NUMA node
mask.

Fixes #13248
2020-03-16 08:57:28 +01:00
Michal Sekletár
1808f76870 shared: split out NUMA code from cpu-set-util.c to numa-util.c 2020-03-16 08:23:18 +01:00
Lennart Poettering
91dd5f7cbe core: add new LogNamespace= execution setting 2020-01-31 15:01:43 +01:00
Kevin Kuehler
fc64760dda core: shared: Add ProtectClock= to systemd.exec 2020-01-26 12:23:33 -08:00
Kevin Kuehler
8470304018 core: Add ProtectKernelLogs
If seccomp is enabled, load the SYSCALL_FILTER_SET_SYSLOG into the
seccomp filter set. Drop the CAP_SYSLOG capability.
2019-11-11 12:12:02 -08:00
Zbigniew Jędrzejewski-Szmek
5ac1530eca tree-wide: say "ratelimit" not "rate_limit"
"ratelimit" is a real word, so we don't need to use the other form anywhere.
We had both forms in various places, let's standarize on the shorter and more
correct one.
2019-09-20 16:05:53 +02:00
Yu Watanabe
12213aed12 core: move timeout_clean_usec from Service to ExecContext 2019-08-28 23:09:54 +09:00
Lennart Poettering
6b7b2ed96b core: add type of resource string table 2019-07-11 12:18:51 +02:00
Lennart Poettering
4c2f584230 core: hook up service unit type with the new clean operation
The implementation is pretty straight-foward: when we get a request to
clean some type of resources we fork off a process doing that, and while
it is running we are in the "cleaning" state.
2019-07-11 12:18:51 +02:00
Lennart Poettering
380dc8b0a2 core: add generic "clean" operation to units
This adds basic infrastructure to implement a "clean" operation for unit
types. This "clean" operation is supposed to remove on-disk resources of
units, and is supposed to be used in a later commit to clean our
RuntimeDirectory=, StateDirectory= and so on of service units.

Later commits will open this up to the bus, and hook up service units
with this.

This also adds a new generic ActiveState called UNIT_MAINTENANCE. It's
supposed to cover all kinds of "maintainance" state of units.
Specifically, this is supposed to cover the "cleaning" operations later
added for service units which might take a bit of time. This high-level,
generic, abstract state is called UNIT_MAINTENANCE instead of the
more specific "UNIT_CLEANING", since I think this should be kept open
for different operations possibly later on that could be nicely subsumed
under this (for example, maybe a recursive chown()ing operation could be
covered by this, and similar).
2019-07-11 12:18:51 +02:00
Michal Sekletar
b070c7c0e1 core: introduce NUMAPolicy and NUMAMask options
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.

Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
2019-06-24 16:58:54 +02:00
Anita Zhang
b3d593673c core: add ExecStartXYZEx= with dbus support for executable prefixes
Closes #11654
2019-05-30 20:41:42 -07:00
Zbigniew Jędrzejewski-Szmek
0985c7c4e2 Rework cpu affinity parsing
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)

Let's rework this, and store the size in bytes of the allocated storage area.

The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.

Fixes #12605.
2019-05-29 10:20:42 +02:00
Lennart Poettering
f69567cbe2 core: expose SUID/SGID restriction as new unit setting RestrictSUIDSGID= 2019-04-02 16:56:48 +02:00
Lennart Poettering
0a6991e0bb tree-wide: reorder various structures to make them smaller and use fewer cache lines
Some "pahole" spelunking.
2019-03-27 18:11:11 +01:00
Zbigniew Jędrzejewski-Szmek
ca78ad1de9 headers: remove unneeded includes from util.h
This means we need to include many more headers in various files that simply
included util.h before, but it seems cleaner to do it this way.
2019-03-27 11:53:12 +01:00
Lennart Poettering
6f765baf23 core: rework how we reset the TTY after use by a service
This makes two changes:

1. Instead of resetting the configured service TTY each time after a
   process exited, let's do so only when the service goes back to "dead"
   state. This should be preferable in case the started processes leave
   background child processes around that still reference the TTY.

2. chmod() and chown() the TTY at the same time. This should make it
   safe to run "systemd-run -p DynamicUser=1 -p StandardInput=tty -p
   TTYPath=/dev/tty8 /bin/bash" without leaving a TTY owned by a dynamic
   user around.
2019-03-20 21:28:02 +01:00
Lennart Poettering
a8d08f39d1 core: add new setting NetworkNamespacePath= for configuring a netns by path for a service
Fixes: #2741
2019-03-07 16:55:23 +01:00
Anita Zhang
7ca69792e5 core: add ':' prefix to ExecXYZ= skip env var substitution 2019-02-20 17:58:14 +01:00
Topi Miettinen
aecd5ac621 core: ProtectHostname= feature
Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
2019-02-20 10:50:44 +02:00