aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2013-11-22Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidlsk-android-13.11Mark Brown
2013-11-22Merge remote-tracking branch 'lsk/v3.10/topic/big.LITTLE' into linux-linaro-lsklsk-13.11Mark Brown
2013-11-22sched: hmp: Fix build breakage when not using CONFIG_SCHED_HMPv3.10/topic/big.LITTLEChris Redpath
hmp_variable_scale_convert was used without guards in __update_entity_runnable_avg. Guard it. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Mark Brown <broonie@linaro.org>
2013-11-21Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidlinux-linaro-lsk-android-testMark Brown
2013-11-21Merge remote-tracking branch 'lsk/v3.10/topic/big.LITTLE' into linux-linaro-lskMark Brown
2013-11-21Merge branch 'for-lsk' of git://git.linaro.org/arm/big.LITTLE/mp into ↵Mark Brown
lsk-v3.10-big.LITTLE
2013-11-21sched: hmp: add read-only hmp domain sysfs fileChris Redpath
In order to allow userspace to restrict known low-load tasks to little CPUs, we must export this knowledge from the kernel or expect userspace to make their own attempts at figuring it out. Since we now have a userspace requirement for an HMP implementation to always have at least some sysfs files, change the integration so that it only depends upon CONFIG_SCHED_HMP rather than CONFIG_HMP_VARIABLE_SCALE. Fix Kconfig text to match. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-11-21HMP: Avoid using the cpu stopper to stop runnable tasksMathieu Poirier
When migrating a runnable task, we use the CPU stopper on the source CPU to ensure that the task to be moved is not currently running. Before this patch, all forced migrations (up, offload, idle pull) use the stopper for every migration. Using the CPU stopper is mandatory only when a task is currently running on a CPU. Otherwise tasks can be moved by locking the source and destination run queues. This patch checks to see if the task to be moved are currently running. If not the task is moved directly without using the stopper thread. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-11-21Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidMark Brown
2013-11-21Merge tag 'v3.10.20' into linux-linaro-lskMark Brown
This is the 3.10.20 stable release
2013-11-20perf: Fix perf ring buffer memory orderingPeter Zijlstra
commit bf378d341e4873ed928dc3c636252e6895a21f50 upstream. The PPC64 people noticed a missing memory barrier and crufty old comments in the perf ring buffer code. So update all the comments and add the missing barrier. When the architecture implements local_t using atomic_long_t there will be double barriers issued; but short of introducing more conditional barrier primitives this is the best we can do. Reported-by: Victor Kaplansky <victork@il.ibm.com> Tested-by: Victor Kaplansky <victork@il.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: michael@ellerman.id.au Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: anton@samba.org Cc: benh@kernel.crashing.org Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Michael Neuling <mikey@neuling.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-20tracing: Fix potential out-of-bounds in trace_get_user()Steven Rostedt
commit 057db8488b53d5e4faa0cedb2f39d4ae75dfbdbb upstream. Andrey reported the following report: ERROR: AddressSanitizer: heap-buffer-overflow on address ffff8800359c99f3 ffff8800359c99f3 is located 0 bytes to the right of 243-byte region [ffff8800359c9900, ffff8800359c99f3) Accessed by thread T13003: #0 ffffffff810dd2da (asan_report_error+0x32a/0x440) #1 ffffffff810dc6b0 (asan_check_region+0x30/0x40) #2 ffffffff810dd4d3 (__tsan_write1+0x13/0x20) #3 ffffffff811cd19e (ftrace_regex_release+0x1be/0x260) #4 ffffffff812a1065 (__fput+0x155/0x360) #5 ffffffff812a12de (____fput+0x1e/0x30) #6 ffffffff8111708d (task_work_run+0x10d/0x140) #7 ffffffff810ea043 (do_exit+0x433/0x11f0) #8 ffffffff810eaee4 (do_group_exit+0x84/0x130) #9 ffffffff810eafb1 (SyS_exit_group+0x21/0x30) #10 ffffffff81928782 (system_call_fastpath+0x16/0x1b) Allocated by thread T5167: #0 ffffffff810dc778 (asan_slab_alloc+0x48/0xc0) #1 ffffffff8128337c (__kmalloc+0xbc/0x500) #2 ffffffff811d9d54 (trace_parser_get_init+0x34/0x90) #3 ffffffff811cd7b3 (ftrace_regex_open+0x83/0x2e0) #4 ffffffff811cda7d (ftrace_filter_open+0x2d/0x40) #5 ffffffff8129b4ff (do_dentry_open+0x32f/0x430) #6 ffffffff8129b668 (finish_open+0x68/0xa0) #7 ffffffff812b66ac (do_last+0xb8c/0x1710) #8 ffffffff812b7350 (path_openat+0x120/0xb50) #9 ffffffff812b8884 (do_filp_open+0x54/0xb0) #10 ffffffff8129d36c (do_sys_open+0x1ac/0x2c0) #11 ffffffff8129d4b7 (SyS_open+0x37/0x50) #12 ffffffff81928782 (system_call_fastpath+0x16/0x1b) Shadow bytes around the buggy address: ffff8800359c9700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd ffff8800359c9780: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa ffff8800359c9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>ffff8800359c9980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[03]fb ffff8800359c9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 ffff8800359c9b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8800359c9c00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap redzone: fa Heap kmalloc redzone: fb Freed heap region: fd Shadow gap: fe The out-of-bounds access happens on 'parser->buffer[parser->idx] = 0;' Although the crash happened in ftrace_regex_open() the real bug occurred in trace_get_user() where there's an incrementation to parser->idx without a check against the size. The way it is triggered is if userspace sends in 128 characters (EVENT_BUF_SIZE + 1), the loop that reads the last character stores it and then breaks out because there is no more characters. Then the last character is read to determine what to do next, and the index is incremented without checking size. Then the caller of trace_get_user() usually nulls out the last character with a zero, but since the index is equal to the size, it writes a nul character after the allocated space, which can corrupt memory. Luckily, only root user has write access to this file. Link: http://lkml.kernel.org/r/20131009222323.04fd1a0d@gandalf.local.home Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-14Merge remote-tracking branch 'lsk/v3.10/topic/android-fixes' into ↵Mark Brown
linux-linaro-lsk-android The cpufreq_interactive changes have been merged upstream and the local version dropped. Conflicts: drivers/cpufreq/cpufreq_interactive.c
2013-11-13Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidMark Brown
2013-11-13Merge tag 'v3.10.19' into linux-linaro-lskMark Brown
This is the 3.10.19 stable release
2013-11-13clockevents: Sanitize ticks to nsec conversionThomas Gleixner
commit 97b9410643475d6557d2517c2aff9fd2221141a9 upstream. Marc Kleine-Budde pointed out, that commit 77cc982 "clocksource: use clockevents_config_and_register() where possible" caused a regression for some of the converted subarchs. The reason is, that the clockevents core code converts the minimal hardware tick delta to a nanosecond value for core internal usage. This conversion is affected by integer math rounding loss, so the backwards conversion to hardware ticks will likely result in a value which is less than the configured hardware limitation. The affected subarchs used their own workaround (SIGH!) which got lost in the conversion. The solution for the issue at hand is simple: adding evt->mult - 1 to the shifted value before the integer divison in the core conversion function takes care of it. But this only works for the case where for the scaled math mult/shift pair "mult <= 1 << shift" is true. For the case where "mult > 1 << shift" we can apply the rounding add only for the minimum delta value to make sure that the backward conversion is not less than the given hardware limit. For the upper bound we need to omit the rounding add, because the backwards conversion is always larger than the original latch value. That would violate the upper bound of the hardware device. Though looking closer at the details of that function reveals another bogosity: The upper bounds check is broken as well. Checking for a resulting "clc" value greater than KTIME_MAX after the conversion is pointless. The conversion does: u64 clc = (latch << evt->shift) / evt->mult; So there is no sanity check for (latch << evt->shift) exceeding the 64bit boundary. The latch argument is "unsigned long", so on a 64bit arch the handed in argument could easily lead to an unnoticed shift overflow. With the above rounding fix applied the calculation before the divison is: u64 clc = (latch << evt->shift) + evt->mult - 1; So we need to make sure, that neither the shift nor the rounding add is overflowing the u64 boundary. [ukl: move assignment to rnd after eventually changing mult, fix build issue and correct comment with the right math] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: nicolas.ferre@atmel.com Cc: Marc Pignat <marc.pignat@hevs.ch> Cc: john.stultz@linaro.org Cc: kernel@pengutronix.de Cc: Ronald Wahl <ronald.wahl@raritan.com> Cc: LAK <linux-arm-kernel@lists.infradead.org> Cc: Ludovic Desroches <ludovic.desroches@atmel.com> Link: http://lkml.kernel.org/r/1380052223-24139-1-git-send-email-u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13cgroup: fix to break the while loop in cgroup_attach_task() correctlyAnjana V Kumar
commit ea84753c98a7ac6b74e530b64c444a912b3835ca upstream. Both Anjana and Eunki reported a stall in the while_each_thread loop in cgroup_attach_task(). It's because, when we attach a single thread to a cgroup, if the cgroup is exiting or is already in that cgroup, we won't break the loop. If the task is already in the cgroup, the bug can lead to another thread being attached to the cgroup unexpectedly: # echo 5207 > tasks # cat tasks 5207 # echo 5207 > tasks # cat tasks 5207 5215 What's worse, if the task to be attached isn't the leader of the thread group, we might never exit the loop, hence cpu stall. Thanks for Oleg's analysis. This bug was introduced by commit 081aa458c38ba576bdd4265fc807fa95b48b9e79 ("cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()") [ lizf: - fixed the first continue, pointed out by Oleg, - rewrote changelog. ] Reported-by: Eunki Kim <eunki_kim@samsung.com> Reported-by: Anjana V Kumar <anjanavk12@gmail.com> Signed-off-by: Anjana V Kumar <anjanavk12@gmail.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-14Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidMark Brown
2013-10-14Merge tag 'v3.10.16' into linux-linaro-lskMark Brown
This is the 3.10.16 stable release
2013-10-13irq: Force hardirq exit's softirq processing on its own stackFrederic Weisbecker
commit ded797547548a5b8e7b92383a41e4c0e6b0ecb7f upstream. The commit facd8b80c67a3cf64a467c4a2ac5fb31f2e6745b ("irq: Sanitize invoke_softirq") converted irq exit calls of do_softirq() to __do_softirq() on all architectures, assuming it was only used there for its irq disablement properties. But as a side effect, the softirqs processed in the end of the hardirq are always called on the inline current stack that is used by irq_exit() instead of the softirq stack provided by the archs that override do_softirq(). The result is mostly safe if the architecture runs irq_exit() on a separate irq stack because then softirqs are processed on that same stack that is near empty at this stage (assuming hardirq aren't nesting). Otherwise irq_exit() runs in the task stack and so does the softirq too. The interrupted call stack can be randomly deep already and the softirq can dig through it even further. To add insult to the injury, this softirq can be interrupted by a new hardirq, maximizing the chances for a stack overrun as reported in powerpc for example: do_IRQ: stack overflow: 1920 CPU: 0 PID: 1602 Comm: qemu-system-ppc Not tainted 3.10.4-300.1.fc19.ppc64p7 #1 Call Trace: [c0000000050a8740] .show_stack+0x130/0x200 (unreliable) [c0000000050a8810] .dump_stack+0x28/0x3c [c0000000050a8880] .do_IRQ+0x2b8/0x2c0 [c0000000050a8930] hardware_interrupt_common+0x154/0x180 --- Exception: 501 at .cp_start_xmit+0x3a4/0x820 [8139cp] LR = .cp_start_xmit+0x390/0x820 [8139cp] [c0000000050a8d40] .dev_hard_start_xmit+0x394/0x640 [c0000000050a8e00] .sch_direct_xmit+0x110/0x260 [c0000000050a8ea0] .dev_queue_xmit+0x260/0x630 [c0000000050a8f40] .br_dev_queue_push_xmit+0xc4/0x130 [bridge] [c0000000050a8fc0] .br_dev_xmit+0x198/0x270 [bridge] [c0000000050a9070] .dev_hard_start_xmit+0x394/0x640 [c0000000050a9130] .dev_queue_xmit+0x428/0x630 [c0000000050a91d0] .ip_finish_output+0x2a4/0x550 [c0000000050a9290] .ip_local_out+0x50/0x70 [c0000000050a9310] .ip_queue_xmit+0x148/0x420 [c0000000050a93b0] .tcp_transmit_skb+0x4e4/0xaf0 [c0000000050a94a0] .__tcp_ack_snd_check+0x7c/0xf0 [c0000000050a9520] .tcp_rcv_established+0x1e8/0x930 [c0000000050a95f0] .tcp_v4_do_rcv+0x21c/0x570 [c0000000050a96c0] .tcp_v4_rcv+0x734/0x930 [c0000000050a97a0] .ip_local_deliver_finish+0x184/0x360 [c0000000050a9840] .ip_rcv_finish+0x148/0x400 [c0000000050a98d0] .__netif_receive_skb_core+0x4f8/0xb00 [c0000000050a99d0] .netif_receive_skb+0x44/0x110 [c0000000050a9a70] .br_handle_frame_finish+0x2bc/0x3f0 [bridge] [c0000000050a9b20] .br_nf_pre_routing_finish+0x2ac/0x420 [bridge] [c0000000050a9bd0] .br_nf_pre_routing+0x4dc/0x7d0 [bridge] [c0000000050a9c70] .nf_iterate+0x114/0x130 [c0000000050a9d30] .nf_hook_slow+0xb4/0x1e0 [c0000000050a9e00] .br_handle_frame+0x290/0x330 [bridge] [c0000000050a9ea0] .__netif_receive_skb_core+0x34c/0xb00 [c0000000050a9fa0] .netif_receive_skb+0x44/0x110 [c0000000050aa040] .napi_gro_receive+0xe8/0x120 [c0000000050aa0c0] .cp_rx_poll+0x31c/0x590 [8139cp] [c0000000050aa1d0] .net_rx_action+0x1dc/0x310 [c0000000050aa2b0] .__do_softirq+0x158/0x330 [c0000000050aa3b0] .irq_exit+0xc8/0x110 [c0000000050aa430] .do_IRQ+0xdc/0x2c0 [c0000000050aa4e0] hardware_interrupt_common+0x154/0x180 --- Exception: 501 at .bad_range+0x1c/0x110 LR = .get_page_from_freelist+0x908/0xbb0 [c0000000050aa7d0] .list_del+0x18/0x50 (unreliable) [c0000000050aa850] .get_page_from_freelist+0x908/0xbb0 [c0000000050aa9e0] .__alloc_pages_nodemask+0x21c/0xae0 [c0000000050aaba0] .alloc_pages_vma+0xd0/0x210 [c0000000050aac60] .handle_pte_fault+0x814/0xb70 [c0000000050aad50] .__get_user_pages+0x1a4/0x640 [c0000000050aae60] .get_user_pages_fast+0xec/0x160 [c0000000050aaf10] .__gfn_to_pfn_memslot+0x3b0/0x430 [kvm] [c0000000050aafd0] .kvmppc_gfn_to_pfn+0x64/0x130 [kvm] [c0000000050ab070] .kvmppc_mmu_map_page+0x94/0x530 [kvm] [c0000000050ab190] .kvmppc_handle_pagefault+0x174/0x610 [kvm] [c0000000050ab270] .kvmppc_handle_exit_pr+0x464/0x9b0 [kvm] [c0000000050ab320] kvm_start_lightweight+0x1ec/0x1fc [kvm] [c0000000050ab4f0] .kvmppc_vcpu_run_pr+0x168/0x3b0 [kvm] [c0000000050ab9c0] .kvmppc_vcpu_run+0xc8/0xf0 [kvm] [c0000000050aba50] .kvm_arch_vcpu_ioctl_run+0x5c/0x1a0 [kvm] [c0000000050abae0] .kvm_vcpu_ioctl+0x478/0x730 [kvm] [c0000000050abc90] .do_vfs_ioctl+0x4ec/0x7c0 [c0000000050abd80] .SyS_ioctl+0xd4/0xf0 [c0000000050abe30] syscall_exit+0x0/0x98 Since this is a regression, this patch proposes a minimalistic and low-risk solution by blindly forcing the hardirq exit processing of softirqs on the softirq stack. This way we should reduce significantly the opportunities for task stack overflow dug by softirqs. Longer term solutions may involve extending the hardirq stack coverage to irq_exit(), etc... Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@au1.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@au1.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-11Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidMark Brown
2013-10-11Merge remote-tracking branch 'lsk/v3.10/topic/big.LITTLE' into linux-linaro-lskMark Brown
2013-10-11Merge tag 'big-LITTLE-MP-13.10' into for-lskJon Medhurst
2013-10-11HMP: Implement task packing for small tasks in HMP systemsChris Redpath
If we wake up a task on a little CPU, fill CPUs rather than spread. Adds 2 new files to sys/kernel/hmp to control packing behaviour. packing_enable: task packing enabled (1) or disabled (0) packing_limit: Runqueues will be filled up to this load ratio. This functionality is disabled by default on TC2 as it lacks per-cpu power gating so packing small tasks there doesn't make sense. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-11hmp: Remove potential for task_struct access raceChris Redpath
Accessing the task_struct can be racy in certain conditions, so we need to only acquire the data when needed. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-11sched: HMP: fix potential logical errorsChris Redpath
The previous API for hmp_up_migration reset the destination CPU every time, regardless of if a migration was desired. The code using it assumed that the value would not be changed unless a migration was required. In one rare circumstance, this could have lead to a task migrating to a little CPU at the wrong time. Fixing that lead to a slight logical tweak to make the surrounding APIs operate a bit more obviously. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Robin Randhawa <robin.randhawa@arm.com> Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-11smp: smp_cross_call function pointer tracingChris Redpath
generic tracing for smp_cross_call function calls Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-11sched: HMP: Additional trace points for debugging HMP behaviourChris Redpath
1. Replace magic numbers in code for migration trace. Trace points still emit a number as force=<n> field: force=0 : wakeup migration force=1 : forced migration force=2 : offload migration force=3 : idle pull migration 2. Add trace to expose offload decision-making. Also adds tracing rq->nr_running so that you can look back to see what state the RQ was in at the time. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-11sched: HMP: Change default HMP thresholdsChris Redpath
When the up-threshold is at 512 on TC2, behaviour looks OK since the graphic-related tasks are very heavy due to lack of a GPU. Increasing the up-threshold does not reduce power consumption. When a GPU is present, graphic tasks are much less CPU-heavy and so additional power may be saved by having a higher threshold. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-04Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidMark Brown
2013-10-04Merge tag 'v3.10.14' into linux-linaro-lskMark Brown
This is the 3.10.14 stable release
2013-10-01audit: fix endless wait in audit_log_start()Konstantin Khlebnikov
commit 8ac1c8d5deba65513b6a82c35e89e73996c8e0d6 upstream. After commit 829199197a43 ("kernel/audit.c: avoid negative sleep durations") audit emitters will block forever if userspace daemon cannot handle backlog. After the timeout the waiting loop turns into busy loop and runs until daemon dies or returns back to work. This is a minimal patch for that bug. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Richard Guy Briggs <rgb@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Chuck Anderson <chuck.anderson@oracle.com> Cc: Dan Duval <dan.duval@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-01sched/fair: Fix small race where child->se.parent,cfs_rq might point to ↵Daisuke Nishimura
invalid ones commit 6c9a27f5da9609fca46cb2b183724531b48f71ad upstream. There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-01sched/cputime: Do not scale when utime == 0Stanislaw Gruszka
commit 5a8e01f8fa51f5cbce8f37acc050eb2319d12956 upstream. scale_stime() silently assumes that stime < rtime, otherwise when stime == rtime and both values are big enough (operations on them do not fit in 32 bits), the resulting scaling stime can be bigger than rtime. In consequence utime = rtime - stime results in negative value. User space visible symptoms of the bug are overflowed TIME values on ps/top, for example: $ ps aux | grep rcu root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0] root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0] root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] or overflowed utime values read directly from /proc/$PID/stat Reference: https://lkml.org/lkml/2013/8/20/259 Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: stable@vger.kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-01timekeeping: Fix HRTICK related deadlock from ntp lock changesJohn Stultz
commit 7bd36014460f793c19e7d6c94dab67b0afcfcb7f upstream. Gerlando Falauto reported that when HRTICK is enabled, it is possible to trigger system deadlocks. These were hard to reproduce, as HRTICK has been broken in the past, but seemed to be connected to the timekeeping_seq lock. Since seqlock/seqcount's aren't supported w/ lockdep, I added some extra spinlock based locking and triggered the following lockdep output: [ 15.849182] ntpd/4062 is trying to acquire lock: [ 15.849765] (&(&pool->lock)->rlock){..-...}, at: [<ffffffff810aa9b5>] __queue_work+0x145/0x480 [ 15.850051] [ 15.850051] but task is already holding lock: [ 15.850051] (timekeeper_lock){-.-.-.}, at: [<ffffffff810df6df>] do_adjtimex+0x7f/0x100 <snip> [ 15.850051] Chain exists of: &(&pool->lock)->rlock --> &p->pi_lock --> timekeeper_lock [ 15.850051] Possible unsafe locking scenario: [ 15.850051] [ 15.850051] CPU0 CPU1 [ 15.850051] ---- ---- [ 15.850051] lock(timekeeper_lock); [ 15.850051] lock(&p->pi_lock); [ 15.850051] lock(timekeeper_lock); [ 15.850051] lock(&(&pool->lock)->rlock); [ 15.850051] [ 15.850051] *** DEADLOCK *** The deadlock was introduced by 06c017fdd4dc48451a ("timekeeping: Hold timekeepering locks in do_adjtimex and hardpps") in 3.10 This patch avoids this deadlock, by moving the call to schedule_delayed_work() outside of the timekeeper lock critical section. Reported-by: Gerlando Falauto <gerlando.falauto@keymile.com> Tested-by: Lin Ming <minggr@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-27Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidMark Brown
2013-09-27Merge tag 'v3.10.13' into linux-linaro-lskMark Brown
This is the 3.10.13 stable release
2013-09-26pidns: fix vfork() after unshare(CLONE_NEWPID)Oleg Nesterov
commit e79f525e99b04390ca4d2366309545a836c03bf1 upstream. Commit 8382fcac1b81 ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared pid_ns, this obviously breaks vfork: int main(void) { assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); assert(vfork() >= 0); _exit(0); return 0; } fails without this patch. Change this check to use CLONE_SIGHAND instead. This also forbids CLONE_THREAD automatically, and this is what the comment implies. We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it would be safer to not do this. The current check denies CLONE_SIGHAND implicitely and there is no reason to change this. Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better. Having shared signal handling between two different pid namespaces is the case that we are fundamentally guarding against." Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Colin Walters <walters@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeupEric W. Biederman
commit a606488513543312805fab2b93070cefe6a3016c upstream. Serge Hallyn <serge.hallyn@ubuntu.com> writes: > Since commit af4b8a83add95ef40716401395b44a1b579965f4 it's been > possible to get into a situation where a pidns reaper is > <defunct>, reparented to host pid 1, but never reaped. How to > reproduce this is documented at > > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526 > (and see > https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13) > In short, run repeated starts of a container whose init is > > Process.exit(0); > > sysrq-t when such a task is playing zombie shows: > > [ 131.132978] init x ffff88011fc14580 0 2084 2039 0x00000000 > [ 131.132978] ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580 > [ 131.132978] ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000 > [ 131.132978] ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0 > [ 131.132978] Call Trace: > [ 131.132978] [<ffffffff816f6159>] schedule+0x29/0x70 > [ 131.132978] [<ffffffff81064591>] do_exit+0x6e1/0xa40 > [ 131.132978] [<ffffffff81071eae>] ? signal_wake_up_state+0x1e/0x30 > [ 131.132978] [<ffffffff8106496f>] do_group_exit+0x3f/0xa0 > [ 131.132978] [<ffffffff810649e4>] SyS_exit_group+0x14/0x20 > [ 131.132978] [<ffffffff8170102f>] tracesys+0xe1/0xe6 > > Further debugging showed that every time this happened, zap_pid_ns_processes() > started with nr_hashed being 3, while we were expecting it to drop to 2. > Any time it didn't happen, nr_hashed was 1 or 2. So the reaper was > waiting for nr_hashed to become 2, but free_pid() only wakes the reaper > if nr_hashed hits 1. The issue is that when the task group leader of an init process exits before other tasks of the init process when the init process finally exits it will be a secondary task sleeping in zap_pid_ns_processes and waiting to wake up when the number of hashed pids drops to two. This case waits forever as free_pid only sends a wake up when the number of hashed pids drops to 1. To correct this the simple strategy of sending a possibly unncessary wake up when the number of hashed pids drops to 2 is adopted. Sending one extraneous wake up is relatively harmless, at worst we waste a little cpu time in the rare case when a pid namespace appropaches exiting. We can detect the case when the pid namespace drops to just two pids hashed race free in free_pid. Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe without out the tasklist_lock because it is guaranteed that the detach_pid will be called on the child_reaper before it is freed and detach_pid calls __change_pid which calls free_pid which takes the pidmap_lock. __change_pid only calls free_pid if this is the last use of the pid. For a thread that is not the thread group leader the threads pid will only ever have one user because a threads pid is not allowed to be the pid of a process, of a process group or a session. For a thread that is a thread group leader all of the other threads of that process will be reaped before it is allowed for the thread group leader to be reaped ensuring there will only be one user of the threads pid as a process pid. Furthermore because the thread is the init process of a pid namespace all of the other processes in the pid namespace will have also been already freed leading to the fact that the pid will not be used as a session pid or a process group pid for any other running process. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Tested-by: Serge Hallyn <serge.hallyn@canonical.com> Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26uprobes: Fix utask->depth accounting in handle_trampoline()Oleg Nesterov
commit 878b5a6efd38030c7a90895dc8346e8fb1e09b4c upstream. Currently utask->depth is simply the number of allocated/pending return_instance's in uprobe_task->return_instances list. handle_trampoline() should decrement this counter every time we handle/free an instance, but due to typo it does this only if ->chained == T. This means that in the likely case this counter is never decremented and the probed task can't report more than MAX_URETPROBE_DEPTH events. Reported-by: Mikhail Kulemin <Mikhail.Kulemin@ru.ibm.com> Reported-by: Hemant Kumar Shaw <hkshaw@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Cc: masami.hiramatsu.pt@hitachi.com Cc: srikar@linux.vnet.ibm.com Cc: systemtap@sourceware.org Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-23Merge branch 'upstream/experimental/android-3.10' into ↵John Stultz
linaro-fixes/experimental/android-3.10
2013-09-19mm: add a field to store names for private anonymous memoryColin Cross
Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Change-Id: Ie2ffc0967d4ffe7ee4c70781313c7b00cf7e3092 Signed-off-by: Colin Cross <ccross@android.com>
2013-09-19add extra free kbytes tunableRik van Riel
Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. This is useful for realtime applications that call system calls and have a bound on the number of allocations that happen in any short time period. In this application, extra_free_kbytes would be left at an amount equal to or larger than than the maximum number of allocations that happen in any burst. It may also be useful to reduce the memory use of virtual machines (temporarily?), in a way that does not cause memory fragmentation like ballooning does. [ccross] Revived for use on old kernels where no other solution exists. The tunable will be removed on kernels that do better at avoiding direct reclaim. Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e Signed-off-by: Rik van Riel<riel@redhat.com> Signed-off-by: Colin Cross <ccross@android.com>
2013-09-08Merge branch 'linux-linaro-lsk' into linux-linaro-lsk-androidMark Brown
2013-09-08Merge tag 'v3.10.11' into linux-linaro-lskMark Brown
This is the 3.10.11 stable release
2013-09-07workqueue: cond_resched() after processing each work itemTejun Heo
commit b22ce2785d97423846206cceec4efee0c4afd980 upstream. If !PREEMPT, a kworker running work items back to back can hog CPU. This becomes dangerous when a self-requeueing work item which is waiting for something to happen races against stop_machine. Such self-requeueing work item would requeue itself indefinitely hogging the kworker and CPU it's running on while stop_machine would wait for that CPU to enter stop_machine while preventing anything else from happening on all other CPUs. The two would deadlock. Jamie Liu reports that this deadlock scenario exists around scsi_requeue_run_queue() and libata port multiplier support, where one port may exclude command processing from other ports. With the right timing, scsi_requeue_run_queue() can end up requeueing itself trying to execute an IO which is asked to be retried while another device has an exclusive access, which in turn can't make forward progress due to stop_machine. Fix it by invoking cond_resched() after executing each work item. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jamie Liu <jamieliu@google.com> References: http://thread.gmane.org/gmane.linux.kernel/1552567 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-07timer_list: correct the iterator for timer_listNathan Zimmer
commit 84a78a6504f5c5394a8e558702e5b54131f01d14 upstream. Correct an issue with /proc/timer_list reported by Holger. When reading from the proc file with a sufficiently small buffer, 2k so not really that small, there was one could get hung trying to read the file a chunk at a time. The timer_list_start function failed to account for the possibility that the offset was adjusted outside the timer_list_next. Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Holger Hans Peter Freyther <holger@freyther.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Berke Durak <berke.durak@xiphos.com> Cc: Jeff Layton <jlayton@redhat.com> Tested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-05Merge tag 'big-LITTLE-MP-13.08' into for-lskJon Medhurst
This merge is intended to tidyup the history of the big.LITTLE MP patchset to enable ongoing maintenance of a standalone MP branch. The only code change introduce by this merge is to add commit 0d5ddd14 (HMP: select 'best' task for migration rather than 'current') This change is already in the Linaro Stable Kernel as commit c5021c1eb9c73f38209180c65bd074ac70c97587
2013-09-05HMP: Update migration timer when we fork-migrateChris Redpath
Prevents fork-migration adversely interacting with normal migration (i.e. runqueues containing forked tasks being selected as migration targets when there is a better choice available) Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Liviu Dudau <liviu.dudau@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-09-05HMP: Access runqueue task clocks directly.Chris Redpath
Avoids accesses through cfs_rq going bad when the cpu_rq doesn't have a cfs member. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Liviu Dudau <liviu.dudau@arm.com> Signed-off-by: Jon Medhurst <tixy@linaro.org>