aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-12-02bpf: samples: Do not touch RLIMIT_MEMLOCKRoman Gushchin
Since bpf is not using rlimit memlock for the memory accounting and control, do not change the limit in sample applications. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-35-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for bpf progsRoman Gushchin
Do not use rlimit-based memory accounting for bpf progs. It has been replaced with memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-34-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting infra for bpf mapsRoman Gushchin
Remove rlimit-based accounting infrastructure code, which is not used anymore. To provide a backward compatibility, use an approximation of the bpf map memory footprint as a "memlock" value, available to a user via map info. The approximation is based on the maximal number of elements and key and value sizes. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-33-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for bpf local storage mapsRoman Gushchin
Do not use rlimit-based memory accounting for bpf local storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-32-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for xskmap mapsRoman Gushchin
Do not use rlimit-based memory accounting for xskmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for stackmap mapsRoman Gushchin
Do not use rlimit-based memory accounting for stackmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-30-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash mapsRoman Gushchin
Do not use rlimit-based memory accounting for sockmap and sockhash maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-29-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for bpf ringbufferRoman Gushchin
Do not use rlimit-based memory accounting for bpf ringbuffer. It has been replaced with the memcg-based memory accounting. bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM) and a valid pointer, so to simplify the code make it return NULL in the first case. This allows to drop a couple of lines in ringbuf_map_alloc() and also makes it look similar to other memory allocating function like kmalloc(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-28-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for reuseport_array mapsRoman Gushchin
Do not use rlimit-based memory accounting for reuseport_array maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-27-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for queue_stack_maps mapsRoman Gushchin
Do not use rlimit-based memory accounting for queue_stack maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-26-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for lpm_trie mapsRoman Gushchin
Do not use rlimit-based memory accounting for lpm_trie maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-25-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for hashtab mapsRoman Gushchin
Do not use rlimit-based memory accounting for hashtab maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-24-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for devmap mapsRoman Gushchin
Do not use rlimit-based memory accounting for devmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-23-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for cgroup storage mapsRoman Gushchin
Do not use rlimit-based memory accounting for cgroup storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-22-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for cpumap mapsRoman Gushchin
Do not use rlimit-based memory accounting for cpumap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-21-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for bpf_struct_ops mapsRoman Gushchin
Do not use rlimit-based memory accounting for bpf_struct_ops maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-20-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for arraymap mapsRoman Gushchin
Do not use rlimit-based memory accounting for arraymap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-19-guro@fb.com
2020-12-02bpf: Refine memcg-based memory accounting for xskmap mapsRoman Gushchin
Extend xskmap memory accounting to include the memory taken by the xsk_map_node structure. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-18-guro@fb.com
2020-12-02bpf: Refine memcg-based memory accounting for sockmap and sockhash mapsRoman Gushchin
Include internal metadata into the memcg-based memory accounting. Also include the memory allocated on updating an element. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-17-guro@fb.com
2020-12-02bpf: Memcg-based memory accounting for bpf local storage mapsRoman Gushchin
Account memory used by bpf local storage maps: per-socket, per-inode and per-task storages. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-16-guro@fb.com
2020-12-02bpf: Memcg-based memory accounting for bpf ringbufferRoman Gushchin
Enable the memcg-based memory accounting for the memory used by the bpf ringbuffer. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-15-guro@fb.com
2020-12-02bpf: Memcg-based memory accounting for lpm_trie mapsRoman Gushchin
Include lpm trie and lpm trie node objects into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-14-guro@fb.com
2020-12-02bpf: Refine memcg-based memory accounting for hashtab mapsRoman Gushchin
Include percpu objects and the size of map metadata into the accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-13-guro@fb.com
2020-12-02bpf: Refine memcg-based memory accounting for devmap mapsRoman Gushchin
Include map metadata and the node size (struct bpf_dtab_netdev) into the accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-12-guro@fb.com
2020-12-02bpf: Memcg-based memory accounting for cgroup storage mapsRoman Gushchin
Account memory used by cgroup storage maps including metadata structures. Account the percpu memory for the percpu flavor of cgroup storage. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-11-guro@fb.com
2020-12-02bpf: Refine memcg-based memory accounting for cpumap mapsRoman Gushchin
Include metadata and percpu data into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-10-guro@fb.com
2020-12-02bpf: Refine memcg-based memory accounting for arraymap mapsRoman Gushchin
Include percpu arrays and auxiliary data into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-9-guro@fb.com
2020-12-02bpf: Memcg-based memory accounting for bpf mapsRoman Gushchin
This patch enables memcg-based memory accounting for memory allocated by __bpf_map_area_alloc(), which is used by many types of bpf maps for large initial memory allocations. Please note, that __bpf_map_area_alloc() should not be used outside of map creation paths without setting the active memory cgroup to the map's memory cgroup. Following patches in the series will refine the accounting for some of the map types. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-8-guro@fb.com
2020-12-02bpf: Prepare for memcg-based memory accounting for bpf mapsRoman Gushchin
Bpf maps can be updated from an interrupt context and in such case there is no process which can be charged. It makes the memory accounting of bpf maps non-trivial. Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel memcg accounting from interrupt contexts") and commit b87d8cefe43c ("mm, memcg: rework remote charging API to support nesting") it's finally possible. To make the ownership model simple and consistent, when the map is created, the memory cgroup of the current process is recorded. All subsequent allocations related to the bpf map are charged to the same memory cgroup. It includes allocations made by any processes (even if they do belong to a different cgroup) and from interrupts. This commit introduces 3 new helpers, which will be used by following commits to enable the accounting of bpf maps memory: - bpf_map_kmalloc_node() - bpf_map_kzalloc() - bpf_map_alloc_percpu() They are wrapping popular memory allocation functions. They set the active memory cgroup to the map's memory cgroup and add __GFP_ACCOUNT to the passed gfp flags. Then they call into the corresponding memory allocation function and restore the original active memory cgroup. These helpers are supposed to use everywhere except the map creation path. During the map creation when the map structure is allocated by itself, it cannot be passed to those helpers. In those cases default memory allocation function will be used with the __GFP_ACCOUNT flag. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-7-guro@fb.com
2020-12-02bpf: Memcg-based memory accounting for bpf progsRoman Gushchin
Include memory used by bpf programs into the memcg-based accounting. This includes the memory used by programs itself, auxiliary data, statistics and bpf line info. A memory cgroup containing the process which loads the program is getting charged. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-6-guro@fb.com
2020-12-02mm: Convert page kmemcg type to a page memcg flagRoman Gushchin
PageKmemcg flag is currently defined as a page type (like buddy, offline, table and guard). Semantically it means that the page was accounted as a kernel memory by the page allocator and has to be uncharged on the release. As a side effect of defining the flag as a page type, the accounted page can't be mapped to userspace (look at page_has_type() and comments above). In particular, this blocks the accounting of vmalloc-backed memory used by some bpf maps, because these maps do map the memory to userspace. One option is to fix it by complicating the access to page->mapcount, which provides some free bits for page->page_type. But it's way better to move this flag into page->memcg_data flags. Indeed, the flag makes no sense without enabled memory cgroups and memory cgroup pointer set in particular. This commit replaces PageKmemcg() and __SetPageKmemcg() with PageMemcgKmem() and an open-coded OR operation setting the memcg pointer with the MEMCG_DATA_KMEM bit. __ClearPageKmemcg() can be simple deleted, as the whole memcg_data is zeroed at once. As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will be compiled out. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-5-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-5-guro@fb.com
2020-12-02mm: Introduce page memcg flagsRoman Gushchin
The lowest bit in page->memcg_data is used to distinguish between struct memory_cgroup pointer and a pointer to a objcgs array. All checks and modifications of this bit are open-coded. Let's formalize it using page memcg flags, defined in enum page_memcg_data_flags. Additional flags might be added later. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-4-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-4-guro@fb.com
2020-12-02mm: memcontrol/slab: Use helpers to access slab page's memcg_dataRoman Gushchin
To gather all direct accesses to struct page's memcg_data field in one place, let's introduce 3 new helpers to use in the slab accounting code: struct obj_cgroup **page_objcgs(struct page *page); struct obj_cgroup **page_objcgs_check(struct page *page); bool set_page_objcgs(struct page *page, struct obj_cgroup **objcgs); They are similar to the corresponding API for generic pages, except that the setter can return false, indicating that the value has been already set from a different thread. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20201027001657.3398190-3-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-3-guro@fb.com
2020-12-02mm: memcontrol: Use helpers to read page's memcg dataRoman Gushchin
Patch series "mm: allow mapping accounted kernel pages to userspace", v6. Currently a non-slab kernel page which has been charged to a memory cgroup can't be mapped to userspace. The underlying reason is simple: PageKmemcg flag is defined as a page type (like buddy, offline, etc), so it takes a bit from a page->mapped counter. Pages with a type set can't be mapped to userspace. But in general the kmemcg flag has nothing to do with mapping to userspace. It only means that the page has been accounted by the page allocator, so it has to be properly uncharged on release. Some bpf maps are mapping the vmalloc-based memory to userspace, and their memory can't be accounted because of this implementation detail. This patchset removes this limitation by moving the PageKmemcg flag into one of the free bits of the page->mem_cgroup pointer. Also it formalizes accesses to the page->mem_cgroup and page->obj_cgroups using new helpers, adds several checks and removes a couple of obsolete functions. As the result the code became more robust with fewer open-coded bit tricks. This patch (of 4): Currently there are many open-coded reads of the page->mem_cgroup pointer, as well as a couple of read helpers, which are barely used. It creates an obstacle on a way to reuse some bits of the pointer for storing additional bits of information. In fact, we already do this for slab pages, where the last bit indicates that a pointer has an attached vector of objcg pointers instead of a regular memcg pointer. This commits uses 2 existing helpers and introduces a new helper to converts all read sides to calls of these helpers: struct mem_cgroup *page_memcg(struct page *page); struct mem_cgroup *page_memcg_rcu(struct page *page); struct mem_cgroup *page_memcg_check(struct page *page); page_memcg_check() is intended to be used in cases when the page can be a slab page and have a memcg pointer pointing at objcg vector. It does check the lowest bit, and if set, returns NULL. page_memcg() contains a VM_BUG_ON_PAGE() check for the page not being a slab page. To make sure nobody uses a direct access, struct page's mem_cgroup/obj_cgroups is converted to unsigned long memcg_data. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
2020-12-02Merge branch 'bpf: expose bpf_{s,g}etsockopt helpers to bind{4,6} hooks'Alexei Starovoitov
Stanislav Fomichev says: ==================== This might be useful for the listener sockets to pre-populate some options. Since those helpers require locked sockets, I'm changing bind hooks to lock/unlock the sockets. This should not cause any performance overhead because at this point there shouldn't be any socket lock contention and the locking/unlocking should be cheap. Also, as part of the series, I convert test_sock_addr bpf assembly into C (and preserve the narrow load tests) to make it easier to extend with th bpf_setsockopt later on. v2: * remove version from bpf programs (Andrii Nakryiko) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-02selftests/bpf: Extend bind{4,6} programs with a call to bpf_setsockoptStanislav Fomichev
To make sure it doesn't trigger sock_owned_by_me splat. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201202172516.3483656-4-sdf@google.com
2020-12-02bpf: Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooksStanislav Fomichev
I have to now lock/unlock socket for the bind hook execution. That shouldn't cause any overhead because the socket is unbound and shouldn't receive any traffic. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com
2020-12-02selftests/bpf: Rewrite test_sock_addr bind bpf into CStanislav Fomichev
I'm planning to extend it in the next patches. It's much easier to work with C than BPF assembly. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201202172516.3483656-2-sdf@google.com
2020-12-01net, xdp, xsk: fix __sk_mark_napi_id_once napi_id errorDaniel Borkmann
Stephen reported the following build error for !CONFIG_NET_RX_BUSY_POLL built kernels: In file included from fs/select.c:32: include/net/busy_poll.h: In function 'sk_mark_napi_id_once': include/net/busy_poll.h:150:36: error: 'const struct sk_buff' has no member named 'napi_id' 150 | __sk_mark_napi_id_once_xdp(sk, skb->napi_id); | ^~ Fix it by wrapping a CONFIG_NET_RX_BUSY_POLL around the helpers. Fixes: b02e5a0ebb17 ("xsk: Propagate napi_id to XDP socket Rx path") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/linux-next/20201201190746.7d3357fb@canb.auug.org.au
2020-12-01Merge branch 'xdp-preferred-busy-polling'Daniel Borkmann
Björn Töpel says: ==================== This series introduces three new features: 1. A new "heavy traffic" busy-polling variant that works in concert with the existing napi_defer_hard_irqs and gro_flush_timeout knobs. 2. A new socket option that let a user change the busy-polling NAPI budget. 3. Allow busy-polling to be performed on XDP sockets. The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Patch 6 touches a lot of drivers, so the Cc: list is grossly long. Example usage: $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Performance simple UDP ping-pong: A packet generator blasts UDP packets from a packet generator to a certain {src,dst}IP/port, so a dedicated ksoftirq will be busy handling the packets at a certain core. A simple UDP test program that simply does recvfrom/sendto is running at the host end. Throughput in pps and RTT latency is measured at the packet generator. /proc/sys/net/core/busy_read is set (20). Min Max Avg (usec) 1. Blocking 2-cores: 490Kpps 1218.192 1335.427 1271.083 2. Blocking, 1-core: 155Kpps 1327.195 17294.855 4761.367 3. Non-blocking, 2-cores: 475Kpps 1221.197 1330.465 1270.740 4. Non-blocking, 1-core: 3Kpps 29006.482 37260.465 33128.367 5. Non-blocking, prefer busy-poll, 1-core: 420Kpps 1202.535 5494.052 4885.443 Scenario 2 and 5 shows when the new option should be used. Throughput go from 155 to 420Kpps, average latency are similar, but the tail latencies are much better for the latter. Performance XDP sockets: Again, a packet generator blasts UDP packets from a packet generator to a certain {src,dst}IP/port. Today, running XDP sockets sample on the same core as the softirq handling, performance tanks mainly because we do not yield to user-space when the XDP socket Rx queue is full. # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r Rx: 64Kpps # # preferred busy-polling, budget 8 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 8 Rx 9.9Mpps # # preferred busy-polling, budget 64 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 64 Rx: 19.3Mpps # # preferred busy-polling, budget 256 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 256 Rx: 21.4Mpps # # preferred busy-polling, budget 512 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 512 Rx: 21.7Mpps Compared to the two-core case: # taskset -c 4 ./xdpsock -i ens785f1 -q 20 -n 1 -r Rx: 20.7Mpps We're getting better single-core performance than two, for this naïve drop scenario. Performance netperf UDP_RR: Note that netperf UDP_RR is not a heavy traffic tests, and preferred busy-polling is not typically something we want to use here. $ echo 20 | sudo tee /proc/sys/net/core/busy_read $ netperf -H 192.168.1.1 -l 30 -t UDP_RR -v 2 -- \ -o min_latency,mean_latency,max_latency,stddev_latency,transaction_rate busy-polling blocking sockets: 12,13.33,224,0.63,74731.177 I hacked netperf to use non-blocking sockets and re-ran: busy-polling non-blocking sockets: 12,13.46,218,0.72,73991.172 prefer busy-polling non-blocking sockets: 12,13.62,221,0.59,73138.448 Using the preferred busy-polling mode does not impact performance. The above tests was done for the 'ice' driver. Thanks to Jakub for suggesting this busy-polling addition [1], and Eric for all input/review! Changes: rfc-v1 [2] -> rfc-v2: * Changed name from bias to prefer. * Base the work on Eric's/Luigi's defer irq/gro timeout work. * Proper GRO flushing. * Build issues for some XDP drivers. rfc-v2 [3] -> v1: * Fixed broken qlogic build. * Do not trigger an IPI (XDP socket wakeup) when busy-polling is enabled. v1 [4] -> v2: * Added napi_id to socionext driver, and added Ilias Acked-by:. (Ilias) * Added a samples patch to improve busy-polling for xdpsock/l2fwd. * Correctly mark atomic operations with {WRITE,READ}_ONCE, to make KCSAN and the code readers happy. (Eric) * Check NAPI budget not to exceed U16_MAX. (Eric) * Added kdoc. v2 [5] -> v3: * Collected Acked-by. * Check NAPI disable prior prefer busy-polling. (Jakub) * Added napi_id registration for virtio-net. (Michael) * Added napi_id registration for veth. v3 [6] -> v4: * Collected Acked-by/Reviewed-by. [1] https://lore.kernel.org/netdev/20200925120652.10b8d7c5@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ [2] https://lore.kernel.org/bpf/20201028133437.212503-1-bjorn.topel@gmail.com/ [3] https://lore.kernel.org/bpf/20201105102812.152836-1-bjorn.topel@gmail.com/ [4] https://lore.kernel.org/bpf/20201112114041.131998-1-bjorn.topel@gmail.com/ [5] https://lore.kernel.org/bpf/20201116110416.10719-1-bjorn.topel@gmail.com/ [6] https://lore.kernel.org/bpf/20201119083024.119566-1-bjorn.topel@gmail.com/ ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2020-12-01samples/bpf: Add option to set the busy-poll budgetBjörn Töpel
Support for the SO_BUSY_POLL_BUDGET setsockopt, via the batching option ('b'). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-11-bjorn.topel@gmail.com
2020-12-01samples/bpf: Add busy-poll support to xdpsockBjörn Töpel
Add a new option to xdpsock, 'B', for busy-polling. This option will also set the batching size, 'b' option, to the busy-poll budget. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-10-bjorn.topel@gmail.com
2020-12-01samples/bpf: Use recvfrom() in xdpsock/l2fwdBjörn Töpel
Start using recvfrom() the l2fwd scenario, instead of poll() which is more expensive and need additional knobs for busy-polling. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-9-bjorn.topel@gmail.com
2020-12-01samples/bpf: Use recvfrom() in xdpsock/rxdropBjörn Töpel
Start using recvfrom() the rxdrop scenario. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-8-bjorn.topel@gmail.com
2020-12-01xsk: Propagate napi_id to XDP socket Rx pathBjörn Töpel
Add napi_id to the xdp_rxq_info structure, and make sure the XDP socket pick up the napi_id in the Rx path. The napi_id is used to find the corresponding NAPI structure for socket busy polling. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
2020-12-01xsk: Add busy-poll support for {recv,send}msg()Björn Töpel
Wire-up XDP socket busy-poll support for recvmsg() and sendmsg(). If the XDP socket prefers busy-polling, make sure that no wakeup/IPI is performed. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-6-bjorn.topel@gmail.com
2020-12-01xsk: Check need wakeup flag in sendmsg()Björn Töpel
Add a check for need wake up in sendmsg(), so that if a user calls sendmsg() when no wakeup is needed, do not trigger a wakeup. To simplify the need wakeup check in the syscall, unconditionally enable the need wakeup flag for Tx. This has a side-effect for poll(); If poll() is called for a socket without enabled need wakeup, a Tx wakeup is unconditionally performed. The wakeup matrix for AF_XDP now looks like: need wakeup | poll() | sendmsg() | recvmsg() ------------+--------------+-------------+------------ disabled | wake Tx | wake Tx | nop enabled | check flag; | check flag; | check flag; | wake Tx/Rx | wake Tx | wake Rx Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-5-bjorn.topel@gmail.com
2020-12-01xsk: Add support for recvmsg()Björn Töpel
Add support for non-blocking recvmsg() to XDP sockets. Previously, only sendmsg() was supported by XDP socket. Now, for symmetry and the upcoming busy-polling support, recvmsg() is added. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-4-bjorn.topel@gmail.com
2020-12-01net: Add SO_BUSY_POLL_BUDGET socket optionBjörn Töpel
This option lets a user set a per socket NAPI budget for busy-polling. If the options is not set, it will use the default of 8. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
2020-12-01net: Introduce preferred busy-pollingBjörn Töpel
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Example usage: $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com