aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2018-02-07x86/retpoline: Avoid retpolines for built-in __init functionsDavid Woodhouse
commit 66f793099a636862a71c59d4a6ba91387b155e0c There's no point in building init code with retpolines, since it runs before any potentially hostile userspace does. And before the retpoline is actually ALTERNATIVEd into place, for much of it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: karahmed@amazon.de Cc: peterz@infradead.org Cc: bp@alien8.de Link: https://lkml.kernel.org/r/1517484441-1420-2-git-send-email-dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-07vfs, fdtable: Prevent bounds-check bypass via speculative executionDan Williams
commit 56c30ba7b348b90484969054d561f711ba196507 'fd' is a user controlled value that is used as a data dependency to read from the 'fdt->fd' array. In order to avoid potential leaks of kernel memory values, block speculative execution of the instruction stream that could issue reads based on an invalid 'file *' returned from __fcheck_files. Co-developed-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: kernel-hardening@lists.openwall.com Cc: gregkh@linuxfoundation.org Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: torvalds@linux-foundation.org Cc: alan@linux.intel.com Link: https://lkml.kernel.org/r/151727418500.33451.17392199002892248656.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-07array_index_nospec: Sanitize speculative array de-referencesDan Williams
commit f3804203306e098dae9ca51540fcd5eb700d7f40 array_index_nospec() is proposed as a generic mechanism to mitigate against Spectre-variant-1 attacks, i.e. an attack that bypasses boundary checks via speculative execution. The array_index_nospec() implementation is expected to be safe for current generation CPUs across multiple architectures (ARM, x86). Based on an original implementation by Linus Torvalds, tweaked to remove speculative flows by Alexei Starovoitov, and tweaked again by Linus to introduce an x86 assembly implementation for the mask generation. Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Co-developed-by: Alexei Starovoitov <ast@kernel.org> Suggested-by: Cyril Novikov <cnovikov@lynx.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: kernel-hardening@lists.openwall.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: gregkh@linuxfoundation.org Cc: torvalds@linux-foundation.org Cc: alan@linux.intel.com Link: https://lkml.kernel.org/r/151727414229.33451.18411580953862676575.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-07module/retpoline: Warn about missing retpoline in moduleAndi Kleen
commit caf7501a1b4ec964190f31f9c3f163de252273b8 There's a risk that a kernel which has full retpoline mitigations becomes vulnerable when a module gets loaded that hasn't been compiled with the right compiler or the right option. To enable detection of that mismatch at module load time, add a module info string "retpoline" at build time when the module was compiled with retpoline support. This only covers compiled C source, but assembler source or prebuilt object files are not checked. If a retpoline enabled kernel detects a non retpoline protected module at load time, print a warning and report it in the sysfs vulnerability file. [ tglx: Massaged changelog ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: gregkh@linuxfoundation.org Cc: torvalds@linux-foundation.org Cc: jeyu@kernel.org Cc: arjan@linux.intel.com Link: https://lkml.kernel.org/r/20180125235028.31211-1-andi@firstfloor.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03tty: fix data race between tty_init_dev and flush of bufGaurav Kohli
commit b027e2298bd588d6fa36ed2eda97447fb3eac078 upstream. There can be a race, if receive_buf call comes before tty initialization completes in n_tty_open and tty->disc_data may be NULL. CPU0 CPU1 ---- ---- 000|n_tty_receive_buf_common() n_tty_open() -001|n_tty_receive_buf2() tty_ldisc_open.isra.3() -002|tty_ldisc_receive_buf(inline) tty_ldisc_setup() Using ldisc semaphore lock in tty_init_dev till disc_data initializes completely. Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org> Reviewed-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-28Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single bug fix to prevent a subtle deadlock in the scheduler core code vs cpu hotplug" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix cpu.max vs. cpuhotplug deadlock
2018-01-24Merge tag 'trace-v4.15-rc9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "With the new ORC unwinder, ftrace stack tracing became disfunctional. One was that ORC didn't know how to handle the ftrace callbacks in general (which Josh fixed). The other was that ORC would just bail if it hit a dynamically allocated trampoline. Which means all ftrace stack tracing that happens from the function tracer would produce no results (that includes killing the max stack size tracer). I added a check to the ORC unwinder to see if the trampoline belonged to ftrace, and if it did, use the orc entry of the static trampoline that was used to create the dynamic one (it would be identical). Finally, I noticed that the skip values of the stack tracing were out of whack. I went through and fixed them up" * tag 'trace-v4.15-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Update stack trace skipping for ORC unwinder ftrace, orc, x86: Handle ftrace dynamically allocated trampolines x86/ftrace: Fix ORC unwinding from ftrace handlers
2018-01-24Revert "module: Add retpoline tag to VERMAGIC"Greg Kroah-Hartman
This reverts commit 6cfb521ac0d5b97470883ff9b7facae264b7ab12. Turns out distros do not want to make retpoline as part of their "ABI", so this patch should not have been merged. Sorry Andi, this was my fault, I suggested it when your original patch was the "correct" way of doing this instead. Reported-by: Jiri Kosina <jikos@kernel.org> Fixes: 6cfb521ac0d5 ("module: Add retpoline tag to VERMAGIC") Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: rusty@rustcorp.com.au Cc: arjan.van.de.ven@intel.com Cc: jeyu@kernel.org Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-24sched/core: Fix cpu.max vs. cpuhotplug deadlockPeter Zijlstra
Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion: tg_set_cfs_bandwidth() get_online_cpus() cpus_read_lock() cfs_bandwidth_usage_inc() static_key_slow_inc() cpus_read_lock() Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-23ftrace, orc, x86: Handle ftrace dynamically allocated trampolinesSteven Rostedt (VMware)
The function tracer can create a dynamically allocated trampoline that is called by the function mcount or fentry hook that is used to call the function callback that is registered. The problem is that the orc undwinder will bail if it encounters one of these trampolines. This breaks the stack trace of function callbacks, which include the stack tracer and setting the stack trace for individual functions. Since these dynamic trampolines are basically copies of the static ftrace trampolines defined in ftrace_*.S, we do not need to create new orc entries for the dynamic trampolines. Finding the return address on the stack will be identical as the functions that were copied to create the dynamic trampolines. When encountering a ftrace dynamic trampoline, we can just use the orc entry of the ftrace static function that was copied for that trampoline. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-01-21mm, page_vma_mapped: Drop faulty pointer arithmetics in check_pte()Kirill A. Shutemov
Tetsuo reported random crashes under memory pressure on 32-bit x86 system and tracked down to change that introduced page_vma_mapped_walk(). The root cause of the issue is the faulty pointer math in check_pte(). As ->pte may point to an arbitrary page we have to check that they are belong to the section before doing math. Otherwise it may lead to weird results. It wasn't noticed until now as mem_map[] is virtually contiguous on flatmem or vmemmap sparsemem. Pointer arithmetic just works against all 'struct page' pointers. But with classic sparsemem, it doesn't because each section memap is allocated separately and so consecutive pfns crossing two sections might have struct pages at completely unrelated addresses. Let's restructure code a bit and replace pointer arithmetic with operations on pfns. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-and-tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-19sparse doesn't support struct randomizationMatthew Wilcox
Without this patch, I drown in a sea of unknown attribute warnings Link: http://lkml.kernel.org/r/20180117024539.27354-1-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-17Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "A delayacct statistics correctness fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: delayacct: Account blkio completion on the correct task
2018-01-17Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 pti bits and fixes from Thomas Gleixner: "This last update contains: - An objtool fix to prevent a segfault with the gold linker by changing the invocation order. That's not just for gold, it's a general robustness improvement. - An improved error message for objtool which spares tearing hairs. - Make KASAN fail loudly if there is not enough memory instead of oopsing at some random place later - RSB fill on context switch to prevent RSB underflow and speculation through other units. - Make the retpoline/RSB functionality work reliably for both Intel and AMD - Add retpoline to the module version magic so mismatch can be detected - A small (non-fix) update for cpufeatures which prevents cpu feature clashing for the upcoming extra mitigation bits to ease backporting" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: module: Add retpoline tag to VERMAGIC x86/cpufeature: Move processor tracing out of scattered features objtool: Improve error message for bad file argument objtool: Fix seg fault with gold linker x86/retpoline: Add LFENCE to the retpoline/RSB filling RSB macros x86/retpoline: Fill RSB on context switch for affected CPUs x86/kasan: Panic if there is not enough memory to boot
2018-01-17module: Add retpoline tag to VERMAGICAndi Kleen
Add a marker for retpoline to the module VERMAGIC. This catches the case when a non RETPOLINE compiled module gets loaded into a retpoline kernel, making it insecure. It doesn't handle the case when retpoline has been runtime disabled. Even in this case the match of the retcompile status will be enforced. This implies that even with retpoline run time disabled all modules loaded need to be recompiled. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Cc: rusty@rustcorp.com.au Cc: arjan.van.de.ven@intel.com Cc: jeyu@kernel.org Cc: torvalds@linux-foundation.org Link: https://lkml.kernel.org/r/20180116205228.4890-1-andi@firstfloor.org
2018-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Two read past end of buffer fixes in AF_KEY, from Eric Biggers. 2) Memory leak in key_notify_policy(), from Steffen Klassert. 3) Fix overflow with bpf arrays, from Daniel Borkmann. 4) Fix RDMA regression with mlx5 due to mlx5 no longer using pci_irq_get_affinity(), from Saeed Mahameed. 5) Missing RCU read locking in nl80211_send_iface() when it calls ieee80211_bss_get_ie(), from Dominik Brodowski. 6) cfg80211 should check dev_set_name()'s return value, from Johannes Berg. 7) Missing module license tag in 9p protocol, from Stephen Hemminger. 8) Fix crash due to too small MTU in udp ipv6 sendmsg, from Mike Maloney. 9) Fix endless loop in netlink extack code, from David Ahern. 10) TLS socket layer sets inverted error codes, resulting in an endless loop. From Robert Hering. 11) Revert openvswitch erspan tunnel support, it's mis-designed and we need to kill it before it goes into a real release. From William Tu. 12) Fix lan78xx failures in full speed USB mode, from Yuiko Oshino. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits) net, sched: fix panic when updating miniq {b,q}stats qed: Fix potential use-after-free in qed_spq_post() nfp: use the correct index for link speed table lan78xx: Fix failure in USB Full Speed sctp: do not allow the v4 socket to bind a v4mapped v6 address sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf sctp: reinit stream if stream outcnt has been change by sinit in sendmsg ibmvnic: Fix pending MAC address changes netlink: extack: avoid parenthesized string constant warning ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY net: Allow neigh contructor functions ability to modify the primary_key sh_eth: fix dumping ARSTR Revert "openvswitch: Add erspan tunnel support." net/tls: Fix inverted error codes to avoid endless loop ipv6: ip6_make_skb() needs to clear cork.base.dst sctp: avoid compiler warning on implicit fallthru net: ipv4: Make "ip route get" match iif lo rules again. netlink: extack needs to be reset each time through loop tipc: fix a memory leak in tipc_nl_node_get_link() ipv6: fix udpv6 sendmsg crash caused by too small MTU ...
2018-01-16delayacct: Account blkio completion on the correct taskJosh Snyder
Before commit: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler") delayacct_blkio_end() was called after context-switching into the task which completed I/O. This resulted in double counting: the task would account a delay both waiting for I/O and for time spent in the runqueue. With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up(). In ttwu, we have not yet context-switched. This is more correct, in that the delay accounting ends when the I/O is complete. But delayacct_blkio_end() relies on 'get_current()', and we have not yet context-switched into the task whose I/O completed. This results in the wrong task having its delay accounting statistics updated. Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(), so that it can update the statistics of the correct task. Signed-off-by: Josh Snyder <joshs@netflix.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: <stable@vger.kernel.org> Cc: Brendan Gregg <bgregg@netflix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-block@vger.kernel.org Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler") Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-15netlink: extack: avoid parenthesized string constant warningJohannes Berg
NL_SET_ERR_MSG() and NL_SET_ERR_MSG_ATTR() lead to the following warning in newer versions of gcc: warning: array initialized from parenthesized string constant Just remove the parentheses, they're not needed in this context since anyway since there can be no operator precendence issues or similar. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-15ptr_ring: document usage around __ptr_ring_peekMichael S. Tsirkin
This explains why is the net usage of __ptr_ring_peek actually ok without locks. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-14Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 pti updates from Thomas Gleixner: "This contains: - a PTI bugfix to avoid setting reserved CR3 bits when PCID is disabled. This seems to cause issues on a virtual machine at least and is incorrect according to the AMD manual. - a PTI bugfix which disables the perf BTS facility if PTI is enabled. The BTS AUX buffer is not globally visible and causes the CPU to fault when the mapping disappears on switching CR3 to user space. A full fix which restores BTS on PTI is non trivial and will be worked on. - PTI bugfixes for EFI and trusted boot which make sure that the user space visible page table entries have the NX bit cleared - removal of dead code in the PTI pagetable setup functions - add PTI documentation - add a selftest for vsyscall to verify that the kernel actually implements what it advertises. - a sysfs interface to expose vulnerability and mitigation information so there is a coherent way for users to retrieve the status. - the initial spectre_v2 mitigations, aka retpoline: + The necessary ASM thunk and compiler support + The ASM variants of retpoline and the conversion of affected ASM code + Make LFENCE serializing on AMD so it can be used as speculation trap + The RSB fill after vmexit - initial objtool support for retpoline As I said in the status mail this is the most of the set of patches which should go into 4.15 except two straight forward patches still on hold: - the retpoline add on of LFENCE which waits for ACKs - the RSB fill after context switch Both should be ready to go early next week and with that we'll have covered the major holes of spectre_v2 and go back to normality" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits) x86,perf: Disable intel_bts when PTI security/Kconfig: Correct the Documentation reference for PTI x86/pti: Fix !PCID and sanitize defines selftests/x86: Add test_vsyscall x86/retpoline: Fill return stack buffer on vmexit x86/retpoline/irq32: Convert assembler indirect jumps x86/retpoline/checksum32: Convert assembler indirect jumps x86/retpoline/xen: Convert Xen hypercall indirect jumps x86/retpoline/hyperv: Convert assembler indirect jumps x86/retpoline/ftrace: Convert ftrace assembler indirect jumps x86/retpoline/entry: Convert entry assembler indirect jumps x86/retpoline/crypto: Convert crypto assembler indirect jumps x86/spectre: Add boot time option to select Spectre v2 mitigation x86/retpoline: Add initial retpoline support objtool: Allow alternatives to be ignored objtool: Detect jumps to retpoline thunks x86/pti: Make unpoison of pgd for trusted boot work for real x86/alternatives: Fix optimize_nops() checking sysfs/cpu: Fix typos in vulnerability documentation x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC ...
2018-01-13Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixlets from Andrew Morton: "4 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: tools/objtool/Makefile: don't assume sync-check.sh is executable kdump: write correct address of mem_section into vmcoreinfo kmemleak: allow to coexist with fault injection MAINTAINERS, nilfs2: change project home URLs
2018-01-13kdump: write correct address of mem_section into vmcoreinfoKirill A. Shutemov
Depending on configuration mem_section can now be an array or a pointer to an array allocated dynamically. In most cases, we can continue to refer to it as 'mem_section' regardless of what it is. But there's one exception: '&mem_section' means "address of the array" if mem_section is an array, but if mem_section is a pointer, it would mean "address of the pointer". We've stepped onto this in kdump code. VMCOREINFO_SYMBOL(mem_section) writes down address of pointer into vmcoreinfo, not array as we wanted. Let's introduce VMCOREINFO_SYMBOL_ARRAY() that would handle the situation correctly for both cases. Link: http://lkml.kernel.org/r/20180112162532.35896-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-12Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "No functional effects intended: removes leftovers from recent lockdep and refcounts work" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/refcounts: Remove stale comment from the ARCH_HAS_REFCOUNT Kconfig entry locking/lockdep: Remove cross-release leftovers locking/Documentation: Remove stale crossrelease_fullstack parameter
2018-01-12net/mlx5: Fix get vector affinity helper functionSaeed Mahameed
mlx5_get_vector_affinity used to call pci_irq_get_affinity and after reverting the patch that sets the device affinity via PCI_IRQ_AFFINITY API, calling pci_irq_get_affinity becomes useless and it breaks RDMA mlx5 users. To fix this, this patch provides an alternative way to retrieve IRQ vector affinity using legacy IRQ API, following smp_affinity read procfs implementation. Fixes: 231243c82793 ("Revert mlx5: move affinity hints assignments to generic code") Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code") Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-01-12{net,ib}/mlx5: Don't disable local loopback multicast traffic when neededEran Ben Elisha
There are systems platform information management interfaces (such as HOST2BMC) for which we cannot disable local loopback multicast traffic. Separate disable_local_lb_mc and disable_local_lb_uc capability bits so driver will not disable multicast loopback traffic if not supported. (It is expected that Firmware will not set disable_local_lb_mc if HOST2BMC is running for example.) Function mlx5_nic_vport_update_local_lb will do best effort to disable/enable UC/MC loopback traffic and return success only in case it succeeded to changed all allowed by Firmware. Adapt mlx5_ib and mlx5e to support the new cap bits. Fixes: 2c43c5a036be ("net/mlx5e: Enable local loopback in loopback selftest") Fixes: c85023e153e3 ("IB/mlx5: Add raw ethernet local loopback support") Fixes: bded747bb432 ("net/mlx5: Add raw ethernet local loopback firmware command") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-01-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-01-09 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Prevent out-of-bounds speculation in BPF maps by masking the index after bounds checks in order to fix spectre v1, and add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for removing the BPF interpreter from the kernel in favor of JIT-only mode to make spectre v2 harder, from Alexei. 2) Remove false sharing of map refcount with max_entries which was used in spectre v1, from Daniel. 3) Add a missing NULL psock check in sockmap in order to fix a race, from John. 4) Fix test_align BPF selftest case since a recent change in verifier rejects the bit-wise arithmetic on pointers earlier but test_align update was missing, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09bpf: avoid false sharing of map refcount with max_entriesDaniel Borkmann
In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds speculation") also change the layout of struct bpf_map such that false sharing of fast-path members like max_entries is avoided when the maps reference counter is altered. Therefore enforce them to be placed into separate cachelines. pahole dump after change: struct bpf_map { const struct bpf_map_ops * ops; /* 0 8 */ struct bpf_map * inner_map_meta; /* 8 8 */ void * security; /* 16 8 */ enum bpf_map_type map_type; /* 24 4 */ u32 key_size; /* 28 4 */ u32 value_size; /* 32 4 */ u32 max_entries; /* 36 4 */ u32 map_flags; /* 40 4 */ u32 pages; /* 44 4 */ u32 id; /* 48 4 */ int numa_node; /* 52 4 */ bool unpriv_array; /* 56 1 */ /* XXX 7 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct user_struct * user; /* 64 8 */ atomic_t refcnt; /* 72 4 */ atomic_t usercnt; /* 76 4 */ struct work_struct work; /* 80 32 */ char name[16]; /* 112 16 */ /* --- cacheline 2 boundary (128 bytes) --- */ /* size: 128, cachelines: 2, members: 17 */ /* sum members: 121, holes: 1, sum holes: 7 */ }; Now all entries in the first cacheline are read only throughout the life time of the map, set up once during map creation. Overall struct size and number of cachelines doesn't change from the reordering. struct bpf_map is usually first member and embedded in map structs in specific map implementations, so also avoid those members to sit at the end where it could potentially share the cacheline with first map values e.g. in the array since remote CPUs could trigger map updates just as well for those (easily dirtying members like max_entries intentionally as well) while having subsequent values in cache. Quoting from Google's Project Zero blog [1]: Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array's length to be slow on other physical CPU cores (intentional false sharing). While this doesn't 'control' the out-of-bounds speculation through masking the index as in commit b2157399cc98, triggering a manipulation of the map's reference counter is really trivial, so lets not allow to easily affect max_entries from it. Splitting to separate cachelines also generally makes sense from a performance perspective anyway in that fast-path won't have a cache miss if the map gets pinned, reused in other progs, etc out of control path, thus also avoids unintentional false sharing. [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Frag and UDP handling fixes in i40e driver, from Amritha Nambiar and Alexander Duyck. 2) Undo unintentional UAPI change in netfilter conntrack, from Florian Westphal. 3) Revert a change to how error codes are returned from dev_get_valid_name(), it broke some apps. 4) Cannot cache routes for ipv6 tunnels in the tunnel is ipv4/ipv6 dual-stack. From Eli Cooper. 5) Fix missed PMTU updates in geneve, from Xin Long. 6) Cure double free in macvlan, from Gao Feng. 7) Fix heap out-of-bounds write in rds_message_alloc_sgs(), from Mohamed Ghannam. 8) FEC bug fixes from FUgang Duan (mis-accounting of dev_id, missed deferral of probe when the regulator is not ready yet). 9) Missing DMA mapping error checks in 3c59x, from Neil Horman. 10) Turn off Broadcom tags for some b53 switches, from Florian Fainelli. 11) Fix OOPS when get_target_net() is passed an SKB whose NETLINK_CB() isn't initialized. From Andrei Vagin. 12) Fix crashes in fib6_add(), from Wei Wang. 13) PMTU bug fixes in SCTP from Marcelo Ricardo Leitner. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits) sh_eth: fix TXALCR1 offsets mdio-sun4i: Fix a memory leak phylink: mark expected switch fall-throughs in phylink_mii_ioctl sctp: fix the handling of ICMP Frag Needed for too small MTUs sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled xen-netfront: enable device after manual module load bnxt_en: Fix the 'Invalid VF' id check in bnxt_vf_ndo_prep routine. bnxt_en: Fix population of flow_type in bnxt_hwrm_cfa_flow_alloc() sh_eth: fix SH7757 GEther initialization net: fec: free/restore resource in related probe error pathes uapi/if_ether.h: prevent redefinition of struct ethhdr ipv6: fix general protection fault in fib6_add() RDS: null pointer dereference in rds_atomic_free_op sh_eth: fix TSU resource handling net: stmmac: enable EEE in MII, GMII or RGMII only rtnetlink: give a user socket to get_target_net() MAINTAINERS: Update my email address. can: ems_usb: improve error reporting for error warning and error passive can: flex_can: Correct the checking for frame length in flexcan_start_xmit() can: gs_usb: fix return value of the "set_bittiming" callback ...
2018-01-09bpf: prevent out-of-bounds speculationAlexei Starovoitov
Under speculation, CPUs may mis-predict branches in bounds checks. Thus, memory accesses under a bounds check may be speculated even if the bounds check fails, providing a primitive for building a side channel. To avoid leaking kernel data round up array-based maps and mask the index after bounds check, so speculated load with out of bounds index will load either valid value from the array or zero from the padded area. Unconditionally mask index for all array types even when max_entries are not rounded to power of 2 for root user. When map is created by unpriv user generate a sequence of bpf insns that includes AND operation to make sure that JITed code includes the same 'index & index_mask' operation. If prog_array map is created by unpriv user replace bpf_tail_call(ctx, map, index); with if (index >= max_entries) { index &= map->index_mask; bpf_tail_call(ctx, map, index); } (along with roundup to power 2) to prevent out-of-bounds speculation. There is secondary redundant 'if (index >= max_entries)' in the interpreter and in all JITs, but they can be optimized later if necessary. Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array) cannot be used by unpriv, so no changes there. That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on all architectures with and without JIT. v2->v3: Daniel noticed that attack potentially can be crafted via syscall commands without loading the program, so add masking to those paths as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-08locking/lockdep: Remove cross-release leftoversIngo Molnar
There's two cross-release leftover facilities: - the crossrelease_hist_*() irq-tracing callbacks (NOPs currently) - the complete_release_commit() callback (NOP as well) Remove them. Cc: David Sterba <dsterba@suse.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-08sysfs/cpu: Add vulnerability folderThomas Gleixner
As the meltdown/spectre problem affects several CPU architectures, it makes sense to have common way to express whether a system is affected by a particular vulnerability or not. If affected the way to express the mitigation should be common as well. Create /sys/devices/system/cpu/vulnerabilities folder and files for meltdown, spectre_v1 and spectre_v2. Allow architectures to override the show function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Link: https://lkml.kernel.org/r/20180107214913.096657732@linutronix.de
2018-01-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: - untangle sys_close() abuses in xt_bpf - deal with register_shrinker() failures in sget() * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'" sget(): handle failures of register_shrinker() mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
2018-01-05Merge branch 'efi-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Thomas Gleixner: - A fix for a add_efi_memmap parameter regression which ensures that the parameter is parsed before it is used. - Reinstate the virtual capsule mapping as the cached copy turned out to break Quark and other things - Remove Matt Fleming as EFI co-maintainer. He stepped back a few days ago. Thanks Matt for all your great work! * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Remove Matt Fleming as EFI co-maintainer efi/capsule-loader: Reinstate virtual capsule mapping x86/efi: Fix kernel param add_efi_memmap regression
2018-01-05sh_eth: fix SH7757 GEther initializationSergei Shtylyov
Renesas SH7757 has 2 Fast and 2 Gigabit Ether controllers, while the 'sh_eth' driver can only reset and initialize TSU of the first controller pair. Shimoda-san tried to solve that adding the 'needs_init' member to the 'struct sh_eth_plat_data', however the platform code still never sets this flag. I think that we can infer this information from the 'devno' variable (set to 'platform_device::id') and reset/init the Ether controller pair only for an even 'devno'; therefore 'sh_eth_plat_data::needs_init' can be removed... Fixes: 150647fb2c31 ("net: sh_eth: change the condition of initialization") Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-05fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"Al Viro
Descriptor table is a shared object; it's not a place where you can stick temporary references to files, especially when we don't need an opened file at all. Cc: stable@vger.kernel.org # v4.14 Fixes: 98589a0998b8 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-03efi/capsule-loader: Reinstate virtual capsule mappingArd Biesheuvel
Commit: 82c3768b8d68 ("efi/capsule-loader: Use a cached copy of the capsule header") ... refactored the capsule loading code that maps the capsule header, to avoid having to map it several times. However, as it turns out, the vmap() call we ended up removing did not just map the header, but the entire capsule image, and dropping this virtual mapping breaks capsules that are processed by the firmware immediately (i.e., without a reboot). Unfortunately, that change was part of a larger refactor that allowed a quirk to be implemented for Quark, which has a non-standard memory layout for capsules, and we have slightly painted ourselves into a corner by allowing quirk code to mangle the capsule header and memory layout. So we need to fix this without breaking Quark. Fortunately, Quark does not appear to care about the virtual mapping, and so we can simply do a partial revert of commit: 2a457fb31df6 ("efi/capsule-loader: Use page addresses rather than struct page pointers") ... and create a vmap() mapping of the entire capsule (including header) based on the reinstated struct page array, unless running on Quark, in which case we pass the capsule header copy as before. Reported-by: Ge Song <ge.song@hxt-semitech.com> Tested-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Tested-by: Ge Song <ge.song@hxt-semitech.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: <stable@vger.kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Fixes: 82c3768b8d68 ("efi/capsule-loader: Use a cached copy of the capsule header") Link: http://lkml.kernel.org/r/20180102172110.17018-3-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-02fscache: Fix the default for fscache_maybe_release_page()David Howells
Fix the default for fscache_maybe_release_page() for when the cookie isn't valid or the page isn't cached. It mustn't return false as that indicates the page cannot yet be freed. The problem with the default is that if, say, there's no cache, but a network filesystem's pages are using up almost all the available memory, a system can OOM because the filesystem ->releasepage() op will not allow them to be released as fscache_maybe_release_page() incorrectly prevents it. This can be tested by writing a sequence of 512MiB files to an AFS mount. It does not affect NFS or CIFS because both of those wrap the call in a check of PG_fscache and it shouldn't bother Ceph as that only has PG_private set whilst writeback is in progress. This might be an issue for 9P, however. Note that the pages aren't entirely stuck. Removing a file or unmounting will clear things because that uses ->invalidatepage() instead. Fixes: 201a15428bd5 ("FS-Cache: Handle pages pending storage that get evicted under OOM conditions") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Marc Dionne <marc.dionne@auristor.com> cc: stable@vger.kernel.org # 2.6.32+
2017-12-31Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A pile of fixes for long standing issues with the timer wheel and the NOHZ code: - Prevent timer base confusion accross the nohz switch, which can cause unlocked access and data corruption - Reinitialize the stale base clock on cpu hotplug to prevent subtle side effects including rollovers on 32bit - Prevent an interrupt storm when the timer softirq is already pending caused by tick_nohz_stop_sched_tick() - Move the timer start tracepoint to a place where it actually makes sense - Add documentation to timerqueue functions as they caused confusion several times now" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timerqueue: Document return values of timerqueue_add/del() timers: Invoke timer_start_debug() where it makes sense nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick() timers: Reinitialize per cpu bases on hotplug timers: Use deferrable base independent of base::nohz_active
2017-12-31Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A rather large update after the kaisered maintainer finally found time to handle regression reports. - The larger part addresses a regression caused by the x86 vector management rework. The reservation based model does not work reliably for MSI interrupts, if they cannot be masked (yes, yet another hw engineering trainwreck). The reason is that the reservation mode assigns a dummy vector when the interrupt is allocated and switches to a real vector when the interrupt is requested. If the MSI entry cannot be masked then the initialization might raise an interrupt before the interrupt is requested, which ends up as spurious interrupt and causes device malfunction and worse. The fix is to exclude MSI interrupts which do not support masking from reservation mode and assign a real vector right away. - Extend the extra lockdep class setup for nested interrupts with a class for the recently added irq_desc::request_mutex so lockdep can differeniate and does not emit false positive warnings. - A ratelimit guard for the bad irq printout so in case a bad irq comes back immediately the system does not drown in dmesg spam" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI genirq/irqdomain: Rename early argument of irq_domain_activate_irq() x86/vector: Use IRQD_CAN_RESERVE flag genirq: Introduce IRQD_CAN_RESERVE flag genirq/msi: Handle reactivation only on success gpio: brcmstb: Make really use of the new lockdep class genirq: Guard handle_bad_irq log messages kernel/irq: Extend lockdep class for request mutex
2017-12-29Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 page table isolation updates from Thomas Gleixner: "This is the final set of enabling page table isolation on x86: - Infrastructure patches for handling the extra page tables. - Patches which map the various bits and pieces which are required to get in and out of user space into the user space visible page tables. - The required changes to have CR3 switching in the entry/exit code. - Optimizations for the CR3 switching along with documentation how the ASID/PCID mechanism works. - Updates to dump pagetables to cover the user space page tables for W+X scans and extra debugfs files to analyze both the kernel and the user space visible page tables The whole functionality is compile time controlled via a config switch and can be turned on/off on the command line as well" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) x86/ldt: Make the LDT mapping RO x86/mm/dump_pagetables: Allow dumping current pagetables x86/mm/dump_pagetables: Check user space page table for WX pages x86/mm/dump_pagetables: Add page table directory to the debugfs VFS hierarchy x86/mm/pti: Add Kconfig x86/dumpstack: Indicate in Oops whether PTI is configured and enabled x86/mm: Clarify the whole ASID/kernel PCID/user PCID naming x86/mm: Use INVPCID for __native_flush_tlb_single() x86/mm: Optimize RESTORE_CR3 x86/mm: Use/Fix PCID to optimize user/kernel switches x86/mm: Abstract switching CR3 x86/mm: Allow flushing for future ASID switches x86/pti: Map the vsyscall page if needed x86/pti: Put the LDT in its own PGD if PTI is on x86/mm/64: Make a full PGD-entry size hole in the memory map x86/events/intel/ds: Map debug buffers in cpu_entry_area x86/cpu_entry_area: Add debugstore entries to cpu_entry_area x86/mm/pti: Map ESPFIX into user space x86/mm/pti: Share entry text PMD x86/entry: Align entry text section to PMD boundary ...
2017-12-29timers: Reinitialize per cpu bases on hotplugThomas Gleixner
The timer wheel bases are not (re)initialized on CPU hotplug. That leaves them with a potentially stale clk and next_expiry valuem, which can cause trouble then the CPU is plugged. Add a prepare callback which forwards the clock, sets next_expiry to far in the future and reset the control flags to a known state. Set base->must_forward_clk so the first timer which is queued will try to forward the clock to current jiffies. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos
2017-12-29genirq/irqdomain: Rename early argument of irq_domain_activate_irq()Thomas Gleixner
The 'early' argument of irq_domain_activate_irq() is actually used to denote reservation mode. To avoid confusion, rename it before abuse happens. No functional change. Fixes: 72491643469a ("genirq/irqdomain: Update irq_domain_ops.activate() signature") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandru Chirvasitu <achirvasub@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Maciej W. Rozycki <macro@linux-mips.org> Cc: Mikael Pettersson <mikpelinux@gmail.com> Cc: Josh Poulson <jopoulso@microsoft.com> Cc: Mihai Costache <v-micos@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-pci@vger.kernel.org Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: devel@linuxdriverproject.org Cc: KY Srinivasan <kys@microsoft.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Sakari Ailus <sakari.ailus@intel.com>, Cc: linux-media@vger.kernel.org
2017-12-29genirq: Introduce IRQD_CAN_RESERVE flagThomas Gleixner
Add a new flag to mark interrupts which can use reservation mode. This is going to be used in subsequent patches to disable reservation mode for a certain class of MSI devices. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Maciej W. Rozycki <macro@linux-mips.org> Cc: Mikael Pettersson <mikpelinux@gmail.com> Cc: Josh Poulson <jopoulso@microsoft.com> Cc: Mihai Costache <v-micos@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-pci@vger.kernel.org Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: devel@linuxdriverproject.org Cc: KY Srinivasan <kys@microsoft.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Sakari Ailus <sakari.ailus@intel.com>, Cc: linux-media@vger.kernel.org
2017-12-29Merge tag 'pm-4.15-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "This fixes a schedutil cpufreq governor regression from the 4.14 cycle that may cause a CPU idleness check to return incorrect results in some cases which leads to suboptimal decisions (Joel Fernandes)" * tag 'pm-4.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: schedutil: Use idle_calls counter of the remote CPU
2017-12-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "This is the next batch of for-rc patches from RDMA. It includes the fix for the ipoib regression I mentioned last time, and the result of a fairly major debugging effort to get iser working reliably on cxgb4 hardware - it turns out the cxgb4 driver was not handling QP error flushing properly causing iser to fail. - cxgb4 fix for an iser testing failure as debugged by Steve and Sagi. The problem was a driver bug in the handling of shutting down a QP. - Various vmw_pvrdma fixes for bogus WARN_ON, missed resource free on error unwind and a use after free bug - Improper congestion counter values on mlx5 when link aggregation is enabled - ipoib lockdep regression introduced in this merge window - hfi1 regression supporting the device in a VM introduced in a recent patch - Typo that breaks future uAPI compatibility in the verbs core - More SELinux related oops fixing - Fix an oops during error unwind in mlx5" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: IB/mlx5: Fix mlx5_ib_alloc_mr error flow IB/core: Verify that QP is security enabled in create and destroy IB/uverbs: Fix command checking as part of ib_uverbs_ex_modify_qp() IB/mlx5: Serialize access to the VMA list IB/hfi: Only read capability registers if the capability exists IB/ipoib: Fix lockdep issue found on ipoib_ib_dev_heavy_flush IB/mlx5: Fix congestion counters in LAG mode RDMA/vmw_pvrdma: Avoid use after free due to QP/CQ/SRQ destroy RDMA/vmw_pvrdma: Use refcount_dec_and_test to avoid warning RDMA/vmw_pvrdma: Call ib_umem_release on destroy QP path iw_cxgb4: when flushing, complete all wrs in a chain iw_cxgb4: reflect the original WR opcode in drain cqes iw_cxgb4: Only validate the MSN for successful completions
2017-12-28cpufreq: schedutil: Use idle_calls counter of the remote CPUJoel Fernandes
Since the recent remote cpufreq callback work, its possible that a cpufreq update is triggered from a remote CPU. For single policies however, the current code uses the local CPU when trying to determine if the remote sg_cpu entered idle or is busy. This is incorrect. To remedy this, compare with the nohz tick idle_calls counter of the remote CPU. Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks) Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Joel Fernandes <joelaf@google.com> Cc: 4.14+ <stable@vger.kernel.org> # 4.14+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-12-28kernel/irq: Extend lockdep class for request mutexAndrew Lunn
The IRQ code already has support for lockdep class for the lock mutex in an interrupt descriptor. Extend this to add a second class for the request mutex in the descriptor. Not having a class is resulting in false positive splats in some code paths. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: linus.walleij@linaro.org Cc: grygorii.strashko@ti.com Cc: f.fainelli@gmail.com Link: https://lkml.kernel.org/r/1512234664-21555-1-git-send-email-andrew@lunn.ch
2017-12-23x86/mm/pti: Add infrastructure for page table isolationThomas Gleixner
Add the initial files for kernel page table isolation, with a minimal init function and the boot time detection for this misfeature. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller" "What's a holiday weekend without some networking bug fixes? [1] 1) Fix some eBPF JIT bugs wrt. SKB pointers across helper function calls, from Daniel Borkmann. 2) Fix regression from errata limiting change to marvell PHY driver, from Zhao Qiang. 3) Fix u16 overflow in SCTP, from Xin Long. 4) Fix potential memory leak during bridge newlink, from Nikolay Aleksandrov. 5) Fix BPF selftest build on s390, from Hendrik Brueckner. 6) Don't append to cfg80211 automatically generated certs file, always write new ones from scratch. From Thierry Reding. 7) Fix sleep in atomic in mac80211 hwsim, from Jia-Ju Bai. 8) Fix hang on tg3 MTU change with certain chips, from Brian King. 9) Add stall detection to arc emac driver and reset chip when this happens, from Alexander Kochetkov. 10) Fix MTU limitng in GRE tunnel drivers, from Xin Long. 11) Fix stmmac timestamping bug due to mis-shifting of field. From Fredrik Hallenberg. 12) Fix metrics match when deleting an ipv4 route. The kernel sets some internal metrics bits which the user isn't going to set when it makes the delete request. From Phil Sutter. 13) mvneta driver loop over RX queues limits on "txq_number" :-) Fix from Yelena Krivosheev. 14) Fix double free and memory corruption in get_net_ns_by_id, from Eric W. Biederman. 15) Flush ipv4 FIB tables in the reverse order. Some tables can share their actual backing data, in particular this happens for the MAIN and LOCAL tables. We have to kill the LOCAL table first, because it uses MAIN's backing memory. Fix from Ido Schimmel. 16) Several eBPF verifier value tracking fixes, from Edward Cree, Jann Horn, and Alexei Starovoitov. 17) Make changes to ipv6 autoflowlabel sysctl really propagate to sockets, unless the socket has set the per-socket value explicitly. From Shaohua Li. 18) Fix leaks and double callback invocations of zerocopy SKBs, from Willem de Bruijn" [1] Is this a trick question? "Relaxing"? "Quiet"? "Fine"? - Linus. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits) skbuff: skb_copy_ubufs must release uarg even without user frags skbuff: orphan frags before zerocopy clone net: reevalulate autoflowlabel setting after sysctl setting openvswitch: Fix pop_vlan action for double tagged frames ipv6: Honor specified parameters in fibmatch lookup bpf: do not allow root to mangle valid pointers selftests/bpf: add tests for recent bugfixes bpf: fix integer overflows bpf: don't prune branches when a scalar is replaced with a pointer bpf: force strict alignment checks for stack pointers bpf: fix missing error return in check_stack_boundary() bpf: fix 32-bit ALU op verification bpf: fix incorrect tracking of register size truncation bpf: fix incorrect sign extension in check_alu_op() bpf/verifier: fix bounds calculation on BPF_RSH ipv4: Fix use-after-free when flushing FIB tables s390/qeth: fix error handling in checksum cmd callback tipc: remove joining group member from congested list selftests: net: Adding config fragment CONFIG_NUMA=y nfp: bpf: keep track of the offloaded program ...
2017-12-21IB/mlx5: Fix congestion counters in LAG modeMajd Dibbiny
Congestion counters are counted and queried per physical function. When working in LAG mode, CNP packets can be sent or received on both of the functions, thus congestion counters should be aggregated from the two physical functions. Fixes: e1f24a79f424 ("IB/mlx5: Support congestion related counters") Signed-off-by: Majd Dibbiny <majd@mellanox.com> Reviewed-by: Aviv Heller <avivh@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>