This is gets the resource limits off a specified process, and is very
similar to prlimit() with a NULL new_rlimit argument. In fact, it tries
that first. However, it then falls back to use /proc/$PID/limits. Why?
Simply because Linux prohibits access to prlimit() for processes with a
different UID, but /proc/$PID/limits still works.
This is preparation to allow nspawn to run unprivileged.
THis brings the list of attributes to delegate to managers of subcgroups
to the state of kernel 6.6.
We probably should unify this list, and maybe generate it automatically
from /sys/kernel/cgroup/delegate, but let's do that another time.
Currently the check doesn't take any settings from nspawn settings
files into account, so let's delay the check until after we've
loaded any settings file.
This reverts commit 30462563b1.
fchmodat2(), while accepting AT_SYMLINK_NOFOLLOW as a valid flag,
always returns EOPNOTSUPP when operating on a symlink. The Linux kernel
simply doesn't support changing the mode of a symlink.
Fixes#30157
Introduce a new env variable $SYSTEMD_NSPAWN_CHECK_OS_RELEASE, that can
be used to disable the os-release check for bootable OS trees. Useful
when trying to boot a container with empty /etc/ and bind-mounted /usr/.
Resolves: #29185
Sometimes it makes sense to hard kill a client if we die. Let's hence
add a third FORK_DEATHSIG flag for this purpose: FORK_DEATHSIG_SIGKILL.
To make things less confusing this also renames FORK_DEATHSIG to
FORK_DEATHSIG_SIGTERM to make clear it sends SIGTERM. We already had
FORK_DEATHSIG_SIGINT, hence this makes things nicely symmetric.
A bunch of users are switched over for FORK_DEATHSIG_SIGKILL where we
know it's safe to abort things abruptly. This should make some kernel
cases more robust, since we cannot get confused by signal masks or such.
While we are at it, also fix a bunch of bugs where we didn't take
FORK_DEATHSIG_SIGINT into account in safe_fork()
varlink_dispatch() is a simple wrapper around json_dispatch() that
returns clean, standards-compliant InvalidParameter error back to
clients, if the specified JSON cannot be parsed properly.
For this json_dispatch() is extended to return the offending field's
name. Because it already has quite a few parameters, I then renamed
json_dispatch() to json_dispatch_full() and made json_dispatch() a
wrapper around it that passes the new argument as NULL. While doing so I
figured we should also get rid of the bad= argument in the short
wrapper, since it's only used in the OCI code.
To simplify the OCI code this adds a second wrapper oci_dispatch()
around json_dispatch_full(), that fills in bad= the way we want.
Net result: instead of one json_dispatch() call there are now:
1. json_dispatch_full() for the fully feature mother of all dispathers.
2. json_dispatch() for the simpler version that you want to use most of
the time.
3. varlink_dispatch() that generates nice Varlink errors
4. oci_dispatch() that does the OCI specific error handling
And that's all there is.
Prior to this commit, if the target had been a symlink, we did nothing
with it. Let's try with fchmodat2() and skip gracefully if not supported.
Co-authored-by: Mike Yuan <me@yhndnzj.com>
If we have a DDI that contains only a /usr/ tree (and which is thus
combined with a tmpfs for root on boot) we previously would try to apply
idmapping to the tmpfs, but not the /usr/ mount. That's broken of
course.
Fix this by applying it to both trees.
Let's wait until the child is fully done with mounting it's own
instances of procfs/sysfs before we destroy our fully visible copies of
it.
This borrows heavily from Christian Brauners fix#29521, but splits the
place + sync into two steps so that the child payload is not started
before the parent has destroyed the procfs instance.
Alternative to: #29521Fixes: #28157
We use it for more than just pipe() arrays. For example also for
socketpair(). Hence let's give it a generic name.
Also add EBADF_TRIPLET to mirror this for things like
stdin/stdout/stderr arrays, which we use a bunch of times.
If systemd-nspawn is newer than the running systemd, we might try to set
CoredumpReceive=yes when systemd doesn't know about it yet. Try and
check if the running systemd is aware of this setting, and if not, don't
try and use it.
Fixes 411d8c72ec
("nspawn: set CoredumpReceive=yes on container's scope when --boot is set").
When --boot is set, and --keep-unit is not, set CoredumpReceive=yes on
the scope allocated for the container. When --keep-unit is set, nspawn
does not allocate the container's unit, so the existing unit needs to
configure this setting itself.
Since systemd-nspawn@.service sets --boot and --keep-unit, add
CoredumpReceives=yes to that unit.
The naming was confused: suffix 'p' means that the function takes a pointer to
the type that the wrapped function takes. (E.g., a char**, for a wrapped function
taking a char*.) But DEFINE_TRIVIAL_DESTRUCTOR() just changes the return type.
Also add one more assert for consistency.
This adds support for the new fsmount() logic of the kernel: we'll first
create an unattached fsmount fd, and then in a second step attach this
to some real file system inode – as opposed to attaching file system
directly. The benefit of this is that we can pass the open fsmount fds
over some sockets if need be, to isolate the mounting code from the
attaching code.
No functional change. In config_parse_address_generation_type() we would set
the output parameter and then say it's ignored, so it _looked_ like an error in
the code, but the variable was always initialized to SD_ID128_NULL anyway, so
the code was actually fine.
seccomp-util.h doesn't need ifdeffing, hence don't. It has worked since
quite a while with HAVE_SECCOMP is off, hence use it everywhere.
Also drop explicit seccomp.h inclusion everywhere (which needs
HAVE_SECCOMP ifdeffery everywhere). seccomp-util.h includes it anyway,
automatically, which we can just rely on, and it deals with HAVE_SECCOMP
at one central place.
Given that ERRNO_IS_SECCOMP_FATAL() also matches positive values,
make sure this macro is not called with arguments that do not have
errno semantics.
In this case the arguments passed to ERRNO_IS_SECCOMP_FATAL() are the
values returned by external libseccomp function seccomp_load() which is
not expected to return any positive values, but let's be consistent
anyway and move ERRNO_IS_SECCOMP_FATAL() invocations to the branches
where the return values are known to be negative.
Given that ERRNO_IS_NOT_SUPPORTED() also matches positive values,
make sure this macro is not called with arguments that do not have
errno semantics.
In this case the argument passed to ERRNO_IS_NOT_SUPPORTED() is the
value returned by remount_idmap() which is not expected to return
any positive values, but let's be consistent anyway and move the
ERRNO_IS_NOT_SUPPORTED() invocation to the branch where
the return value is known to be negative.