O_CREAT doesn't make sense for fd_reopen, since we're
working on an already opened fd. Also, in fd_reopen
we don't handle the mode parameter of open(2), which
means we may get runtime error like #29938.
Until now, using any form of seccomp while being unprivileged (User=)
resulted in systemd enabling no_new_privs.
There's no need for doing this because:
* We trust the filters we apply
* If User= is set and a process wants to apply a new seccomp filter, it
will need to set no_new_privs itself
An example of application that might want seccomp + !no_new_privs is a
program that wants to run as an unprivileged user but uses file
capabilities to start a web server on a privileged port while
benefitting from a restrictive seccomp profile.
We now keep the privileges needed to do seccomp before calling
enforce_user() and drop them after the seccomp filters are applied.
If the syscall filter doesn't allow the needed syscalls to drop the
privileges, we keep the previous behavior by enabling no_new_privs.
JSON-I (RFC 7493) suggests to use the URL safe base64 alphabet, rather
than the regular one when encoding binary data in JSON strings. We
generally uses the regular alphabet though.
Let's be tolerant in what we parse however: simply accept both formats
when we parse base64.
This does nothing about base64 generation though, only about parsing.
This is just like lock_generic(), but applies the lock with a timeout.
This requires jumping through some hoops by executing things in a child
process, so that we can abort if necessary via a timer. Linux after all
has no native way to take file locks with a timeout.
Sometimes it makes sense to hard kill a client if we die. Let's hence
add a third FORK_DEATHSIG flag for this purpose: FORK_DEATHSIG_SIGKILL.
To make things less confusing this also renames FORK_DEATHSIG to
FORK_DEATHSIG_SIGTERM to make clear it sends SIGTERM. We already had
FORK_DEATHSIG_SIGINT, hence this makes things nicely symmetric.
A bunch of users are switched over for FORK_DEATHSIG_SIGKILL where we
know it's safe to abort things abruptly. This should make some kernel
cases more robust, since we cannot get confused by signal masks or such.
While we are at it, also fix a bunch of bugs where we didn't take
FORK_DEATHSIG_SIGINT into account in safe_fork()
This is just like FORMAT_PROC_FD_PATH() but goes via the PID number
rather than the "self" symlink.
This is useful whenever we want to generate a path that is useful
outside of our local scope.
Follow-up for 8b45281daa
and preparation for later commits.
Since libcs are more interested in the POSIX `fchmodat(3)`, they are
unlikely to provide a direct wrapper for this syscall. Thus, the headers
we examine to set `HAVE_*` are picked somewhat arbitrarily.
Also, hook up `try_fchmodat2()` in `test-seccomp.c`. (Also, correct that
function's prototype, despite the fact that mistake would not matter in
practice)
Co-authored-by: Mike Yuan <me@yhndnzj.com>
If we use CHASE_PARENT on a path ending in ".." then things are a bit
weird, because we the last path we look at is actually the *parent* and not
the *child* of the preceeding path. Hence we cannot just return the 2nd
to last fd we look at. We have to correct it, by going *two* levels up,
to get to the actual parent, and make sure CHASE_PARENT does what it
should.
Example: for the path /a/b/c chase() with CHASE_PARENT will return
/a/b/c as path, and the fd returned points to /a/b. All good. But now,
for the path /a/b/c/.. chase() with CHASE_PARENT would previously return
/a/b as path (which is OK) but the fd would point to /a/b/c, which is
*not* the parent of /a/b, after all! To get to the actual parent of
/a/b we have to go *two* levels up to get to /a.
Very confusing. But that's what we here for, no?
@mrc0mmand ran into this in https://github.com/systemd/systemd/pull/28891#issuecomment-1782833722
We use it for more than just pipe() arrays. For example also for
socketpair(). Hence let's give it a generic name.
Also add EBADF_TRIPLET to mirror this for things like
stdin/stdout/stderr arrays, which we use a bunch of times.
This is just like cg_is_delegate() but operates on an fd instead of a
cgroup path.
Sooner or later we should access cgroupfs mostly via fds rather than
paths, but we aren't there yet. But let's at least get started.
Automatically softreboot if the nextroot has been set up with an OS
tree, or automatically kexec if a kernel has been loaded with kexec
--load.
Add SYSTEMCTL_SKIP_AUTO_KEXEC and SYSTEMCTL_SKIP_AUTO_SOFT_REBOOT to
skip the automated switchover.
Even more than with the previous commit, this is not a trivial function
and there's no reason to believe this will actually be inlined nor that
it would be beneficial.