Currently DelegateNamespaces= only works for services spawned by the
system manager. User managers will always unshare the user namespace
first even if they're running with CAP_SYS_ADMIN.
Let's add support for DelegateNamespaces= for user managers if they're
running with CAP_SYS_ADMIN. By default, we'll still delegate all
namespaces
for user managers, but this can now be overridden by explicitly passing
DelegateNamespaces=.
If a user manager is running without CAP_SYS_ADMIN, the user manager is
still always unshared first just like before.
capability_ambient_set_apply() can be called with capability sets
containing unknown capabilities. Let's not crash when this is the
case but instead ignore the unknown capabilities.
This fixes a crash when running the following command:
"systemd-run -p "AmbientCapabilities=~" --wait --pipe id"
Fixes d5e12dc75e
Let's return the size in a return parameter instead of the return value.
And if NULL is specified this tells us the caller doesn't care about the
size and expects a NUL terminated string. In that case look for an
embedded NUL byte, and refuse in that case.
This should lock things down a bit, as we'll systematically refuse
embedded NUL strings now when we expect strings.
This is a simple helper for creating a userns that just maps the
callers user to UID 0 in the namespace. This can be acquired unpriv,
which makes it useful for various purposes, for example for the logic in
is_idmapping_supported(), hence port it over.
(is_idmapping_supported() used a different mapping before, with the
nobody users, but there's no real reason for that, and we'll use
userns_acquire_self_root() elsewhere soon, where the root mapping is
important).
Unprivileged namespaces are only allowed if the "setgroups" file is set
to "deny" for processes. And we need to write it before writing the
gidmap. Hence add a parameter for that.
Then, also patch all current users to actually enable this. The usecase
generally don't need it (because they don't care about unprivileged
userns), but it doesn't hurt to enable the concept anyway in all current
users (none of them actually runs complex userspace in them, but they
mostly use userns_acquire() for idmapped mounts and similar).
Let's anyway make this option explicit in the function call, to indicate
that the concept exists and is applied.
To support C23, this introduces UTF8() macro to define UTF-8 literals,
as C23 changed char8_t from char to unsigned char.
This also makes pointer signedness warning critical, and updates C
standards table for tests.
C23 changed char8_t from char to unsigned char, hence assigning a u8 literal
to const char* emits pointer sign warning, e.g.
========
../src/shared/qrcode-util.c: In function ‘print_border’:
../src/shared/qrcode-util.c:16:34: warning: pointer targets in passing argument 1 of ‘fputs’ differ in signedness [-Wpointer-sign]
16 | #define UNICODE_FULL_BLOCK u8"█"
| ^~~~~
| |
| const unsigned char *
../src/shared/qrcode-util.c:65:39: note: in expansion of macro ‘UNICODE_FULL_BLOCK’
65 | fputs(UNICODE_FULL_BLOCK, output);
| ^~~~~~~~~~~~~~~~~~
========
This introduces UTF8() macro, which define u8 literal and casts to consth char*,
then rewrites all u8 literal definitions with the macro.
With this change, we can build systemd with C23.
Admittedly, some of our glyphs _are_ special, e.g. "O=" for SPECIAL_GLYPH_TOUCH ;)
But we don't need this in the name. The very long names make some invocations
very wordy, e.g. special_glyph(SPECIAL_GLYPH_SLIGHTLY_UNHAPPY_SMILEY).
Also, I want to add GLYPH_SPACE, which is not special at all.
Kernel API file systems typically use either "raw" or "seq_file" to
implement their various interface files. The former are really simple
(to point I'd call them broken), in that they have no understanding of
file offsets, and return their contents again and again on every read(),
and thus EOF is indicated by a short read, not by a zero read. The
latter otoh works like a typical file: you read until you get a
zero-sized read back.
We have read_virtual_file() to read the "raw" files, and can use regular
read_full_file() to read the "seq_file" ones.
Apparently all files in the top-level /proc/ directory use 'seq_file'.
but we accidentally used read_virtual_file() for them. Fix that.
Also clarify in a comment what the rules are.
Fixes: #36131
In one of the next commits we'd like to introduce a concept of
optionally hashing the hostname from the machine ID. For that we we need
to optionally back gethostname_full() by code involving sd-id128, hence
let's move it from src/basic/ to src/shared/, since only there we are
allowed to use our public APIs.
The capsule instances are related to user instances, so treat them
equally to user@.service when handling cgroup paths. This also saves us
from polluting public libsystemd API with variant for capsules too.
Fix: https://github.com/systemd/systemd/issues/36098
The capsule instances are related to user instances, so treat them
equally to user@.service when handling cgroup paths. This also saves us
from polluting public libsystemd API with variant for capsules too.
Fix: #36098
So apparently "linux,dummy-virt" is a devicetree in popular use by
various hypervisors, including crosvm:
e5d7a64d37/aarch64/src/fdt.rs (L692)
and qemu:
98c7362b1e/hw/arm/virt.c (L283)
and that's because the kernel ships support for that natively:
https://www.kernel.org/doc/Documentation/devicetree/bindings/arm/linux%2Cdummy-virt.yaml
It's explicitly for using in virtualization. Hence it's suitable for
detecting it as generic fallback.
This hence adds the check, similar to how we already look for one other
qemu-specific devicetree.
I ran into this while playing around with the new Pixel "Linux Terminal"
app from google which runs a Debian in a crosvm apparently. So far
systemd didn't recognize execution in it at all. Let's at least
recognize it as VM at all, even if this doesn't recognize it as
crosvm.
tianocore does some weird shit with its terminal emulation and regular
fills half the terminal with grey background and then invokes us with
this not cleared up. Hence let us clear this up for it: as part of the
ansi sequence based reset let's position the cursor explicitly at the
beginning of the current line, and erase everything till the end of the
screen. This makes boot output in tianocore vms much much cleaner.
Note that this does *not* erase any terminal output *before* the cursor
position where we take over, because that typically contains valuable
information still we should not erase.
This used to be relevant since in old versions of glibc an internal
cache is maintained, while we might sidestep their invalidation
with raw_clone(). After glibc 2.25 getpid() is a trivial wrapper
for the syscall, and hence there's no need to have a separate
raw_getpid().
We always validate that the target value is below _LOG_TARGET_SINGLE_MAX
before acessing it, but we don't actually size the array like that.
let's fix that.
This doesn#t effectively change anything, but it makes things more
explicit what the limit here is.
glibc exports getdents64 syscall as is, but musl exports it as
posix_getdents(). Let's introduce a simple wrapper of posix_getdents().
Note, our baseline for glibc is 2.31. Hence, we can assume getdents64()
always defined when building with glibc.
To resolve conflict with sys/mount.h and linux/mount.h or linux/fs.h.
The conflict between sys/mount.h and linux/mount.h is resolved in
glibc-2.37 (774058d72942249f71d74e7f2b639f77184160a6), but our baseline
is still glibc-2.31. Also, even with the version or newer, still
sys/mount.h conflicts with linux/fs.h, which is included by
linux/btrfs.h.
This introduces our own implementation of sys/mount.h, that can be
simultaneously included with linux/mount.h and linux/fs.h. This also
imports linux/fs.h, linux/mount.h, and several other dependent headers.
The introduced sys/mount.h header itself may not be enough simple, but
by using the header, we can drop most of workarounds in other source files.