aboutsummaryrefslogtreecommitdiff
path: root/ipc
AgeCommit message (Collapse)Author
2012-05-31ipc/mqueue: strengthen checks on mqueue creationDoug Ledford
We already check the mq attr struct if it's passed in, but now that the admin can set system wide defaults separate from maximums, it's actually possible to set the defaults to something that would overflow. So, if there is no attr struct passed in to the open call, check the default values. While we are at it, simplify mq_attr_ok() by making it return 0 or an error condition, so that way if we add more tests to it later, we have the option of what error should be returned instead of the calling location having to pick a possibly inaccurate error code. [akpm@linux-foundation.org: s/ENOMEM/EOVERFLOW/] Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Manfred Spraul <manfred@colorfullife.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: correct mq_attr_ok testDoug Ledford
While working on the other parts of the mqueue stuff, I noticed that the calculation for overflow in mq_attr_ok didn't actually match reality (this is especially true since my last patch which changed how we account memory slightly). In particular, we used to test for overflow using: msgs * msgsize + msgs * sizeof(struct msg_msg *) That was never really correct because each message we allocate via load_msg() is actually a struct msg_msg followed by the data for the message (and if struct msg_msg + data exceeds PAGE_SIZE we end up allocating struct msg_msgseg structs too, but accounting for them would get really tedious, so let's ignore those...they're only a pointer in size anyway). This patch updates the calculation to be more accurate in regards to maximum possible memory consumption by the mqueue. [akpm@linux-foundation.org: add a local to simplify overflow-checking expression] Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Manfred Spraul <manfred@colorfullife.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: improve performance of send/recvDoug Ledford
The existing implementation of the POSIX message queue send and recv functions is, well, abysmal. Even worse than abysmal. I submitted a patch to increase the maximum POSIX message queue limit to 65536 due to customer needs, however, upon looking over the send/recv implementation, I realized that my customer needs help with that too even if they don't know it. The basic problem is that, given the fairly typical use case scenario for a large queue of queueing lots of messages all at the same priority (I verified with my customer that this is indeed what their app does), the msg_insert routine is basically a frikkin' bubble sort. I mean, whoa, that's *so* middle school. OK, OK, to not slam the original author too much, I'm sure they didn't envision a queue depth of 50,000+ messages. No one would think that moving elements in an array, one at a time, and dereferencing each pointer in that array to check priority of the message being pointed too, again one at a time, for 50,000+ times would be good. So let's assume that, as is typical, the users have found a way to break our code simply by using it in a way we didn't envision. Fair enough. "So, just how broken is it?", you ask. I wondered the same thing, so I wrote an app to let me know. It's my next patch. It gave me some interesting results. Here's what it tested: Interference with other apps - In continuous mode, the app just sits there and hits a message queue forever, while you go do something productive on another terminal using other CPUs. You then measure how long it takes you to do that something productive. Then you restart the app in fake continuous mode, and it sits in a tight loop on a CPU while you repeat your tests. The whole point of this is to keep one CPU tied up (so it can't be used in your other work) but in one case tied up hitting the mqueue code so we can see the effect of walking that 65,528 element array one pointer at a time on the global CPU cache. If it's bad, then it will slow down your app on the other CPUs just by polluting cache mercilessly. In the fake case, it will be in a tight loop, but not polluting cache. Testing the mqueue subsystem directly - Here we just run a number of tests to see how the mqueue subsystem performs under different conditions. A couple conditions are known to be worst case for the old system, and some routines, so this tests all of them. So, on to the results already: Subsystem/Test Old New Time to compile linux kernel (make -j12 on a 6 core CPU) Running mqueue test user 49m10.744s user 45m26.294s sys 5m51.924s sys 4m59.894s total 55m02.668s total 50m26.188s Running fake test user 45m32.686s user 45m18.552s sys 5m12.465s sys 4m56.468s total 50m45.151s total 50m15.020s % slowdown from mqueue cache thrashing ~8% ~.5% Avg time to send/recv (in nanoseconds per message) when queue empty 305/288 349/318 when queue full (65528 messages) constant priority 526589/823 362/314 increasing priority 403105/916 495/445 decreasing priority 73420/594 482/409 random priority 280147/920 546/436 Time to fill/drain queue (65528 messages, in seconds) constant priority 17.37/.12 .13/.12 increasing priority 4.14/.14 .21/.18 decreasing priority 12.93/.13 .21/.18 random priority 8.88/.16 .22/.17 So, I think the results speak for themselves. It's possible this implementation could be improved by cacheing at least one priority level in the node tree (that would bring the queue empty performance more in line with the old implementation), but this works and is *so* much better than what we had, especially for the common case of a single priority in use, that further refinements can be in follow on patches. [akpm@linux-foundation.org: fix typo in comment, remove stray semicolon] [levinsasha928@gmail.com: use correct gfp flags in msg_insert] Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Manfred Spraul <manfred@colorfullife.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31mqueue: separate mqueue default value from maximum valueKOSAKI Motohiro
Commit b231cca4381e ("message queues: increase range limits") changed mqueue default value when attr parameter is specified NULL from hard coded value to fs.mqueue.{msg,msgsize}_max sysctl value. This made large side effect. When user need to use two mqueue applications 1) using !NULL attr parameter and it require big message size and 2) using NULL attr parameter and only need small size message, app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large memory size even though it doesn't need. Doug Ledford propsed to switch back it to static hard coded value. However it also has a compatibility problem. Some applications might started depend on the default value is tunable. The solution is to separate default value from maximum value. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Acked-by: Doug Ledford <dledford@redhat.com> Acked-by: Joe Korty <joe.korty@ccur.com> Cc: Amerigo Wang <amwang@redhat.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31mqueue: don't use kmalloc with KMALLOC_MAX_SIZEKOSAKI Motohiro
KMALLOC_MAX_SIZE is not a good threshold. It is extremely high and problematic. Unfortunately, some silly drivers depend on this and we can't change it. But any new code needn't use such extreme ugly high order allocations. It brings us awful fragmentation issues and system slowdown. Signed-off-by: KOSAKI Motohiro <mkosaki@jp.fujitsu.com> Acked-by: Doug Ledford <dledford@redhat.com> Acked-by: Joe Korty <joe.korty@ccur.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Joe Korty <joe.korty@ccur.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: update maximums for the mqueue subsystemDoug Ledford
Commit b231cca4381e ("message queues: increase range limits") changed the maximum size of a message in a message queue from INT_MAX to 8192*128. Unfortunately, we had customers that relied on a size much larger than 8192*128 on their production systems. After reviewing POSIX, we found that it is silent on the maximum message size. We did find a couple other areas in which it was not silent. Fix up the mqueue maximums so that the customer's system can continue to work, and document both the POSIX and real world requirements in ipc_namespace.h so that we don't have this issue crop back up. Also, commit 9cf18e1dd74cd0 ("ipc: HARD_MSGMAX should be higher not lower on 64bit") fiddled with HARD_MSGMAX without realizing that the number was intentionally in place to limit the msg queue depth to one that was small enough to kmalloc an array of pointers (hence why we divided 128k by sizeof(long)). If we wish to meet POSIX requirements, we have no choice but to change our allocation to a vmalloc instead (at least for the large queue size case). With that, it's possible to increase our allowed maximum to the POSIX requirements (or more if we choose). [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h] Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Jiri Slaby <jslaby@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: enforce hard limitsDoug Ledford
In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps. In preparation for making more reasonable hard limits, start enforcing them even on CAP_SYS_RESOURCE. Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Jiri Slaby <jslaby@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: switch back to using non-max values on createDoug Ledford
Commit b231cca4381e ("message queues: increase range limits") changed how we create a queue that does not include an attr struct passed to open so that it creates the queue with whatever the maximum values are. However, if the admin has set the maximums to allow flexibility in creating a queue (aka, both a large size and large queue are allowed, but combined they create a queue too large for the RLIMIT_MSGQUEUE of the user), then attempts to create a queue without an attr struct will fail. Switch back to using acceptable defaults regardless of what the maximums are. Note: so far, we only know of a few applications that rely on this behavior (specifically, set the maximums in /proc, then run the application which calls mq_open() without passing in an attr struct, and the application expects the newly created message queue to have the maximum sizes that were set in /proc used on the mq_open() call, and all of those applications that we know of are actually part of regression test suites that were coded to do something like this: for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do echo $size > /proc/sys/fs/mqueue/msgsize_max mq_open || echo "Error opening mq with size $size" done These test suites that depend on any behavior like this are broken. The concept that programs should rely upon the system wide maximum in order to get their desired results instead of simply using a attr struct to specify what they want is fundamentally unfriendly programming practice for any multi-tasking OS. Fixing this will break those few apps that we know of (and those app authors recognize the brokenness of their code and the need to fix it). However, the following patch "mqueue: separate mqueue default value" allows a workaround in the form of new knobs for the default msg queue creation parameters for any software out there that we don't already know about that might rely on this behavior at the moment. Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Jiri Slaby <jslaby@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: cleanup definition names and locationsDoug Ledford
Since commit b231cca4381e ("message queues: increase range limits") on Oct 18, 2008, calls to mq_open() that did not pass in an attribute struct and expected to get default values for the size of the queue and the max message size now get the system wide maximums instead of hardwired defaults like they used to get. This was uncovered when one of the earlier patches in this patch set increased the default system wide maximums at the same time it increased the hard ceiling on the system wide maximums (a customer specifically needed the hard ceiling brought back up, the new ceiling that commit b231cca4381e introduced was too low for their production systems). By increasing the default maximums and not realising they were tied to any attempt to create a message queue without an attribute struct, I had inadvertently made it such that all message queue creation attempts without an attribute struct were failing because the new default maximums would create a queue that exceeded the default rlimit for message queue bytes. As a result, the system wide defaults were brought back down to their previous levels, and the system wide ceilings on the maximums were raised to meet the customer's needs. However, the fact that the no attribute struct behavior of mq_open() could be broken by changing the system wide maximums for message queues was seen as fundamentally broken itself. So we hardwired the no attribute case back like it used to be. But, then we realized that on the very off chance that some piece of software in the wild depended on that behavior, we could work around that issue by adding two new knobs to /proc that allowed setting the defaults for message queues created without an attr struct separately from the system wide maximums. What is not an option IMO is to leave the current behavior in place. No piece of software should ever rely on setting the system wide maximums in order to get a desired message queue. Such a reliance would be so fundamentally multitasking OS unfriendly as to not really be tolerable. Fortunately, we don't know of any software in the wild that uses this except for a regression test program that caught the issue in the first place. If there is though, we have made accommodations with the two new /proc knobs (and that's all the accommodations such fundamentally broken software can be allowed).. This patch: The various defines for minimums and maximums of the sysctl controllable mqueue values are scattered amongst different files and named inconsistently. Move them all into ipc_namespace.h and make them have consistent names. Additionally, make the number of queues per namespace also have a minimum and maximum and use the same sysctl function as the other two settable variables. Signed-off-by: Doug Ledford <dledford@redhat.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Jiri Slaby <jslaby@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-28Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linuxLinus Torvalds
Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally
2012-05-06vfs: Rename end_writeback() to clear_inode()Jan Kara
After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-03userns: Replace user_ns_map_uid and user_ns_map_gid with from_kuid and from_kgidEric W. Biederman
These function are no longer needed replace them with their more useful equivalents. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-07mqueue: Explicitly capture the user namespace to send the notification to.Eric W. Biederman
Stop relying on user->user_ns which is going away and instead capture the user_namespace of the process we are supposed to notify. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-07userns: Use cred->user_ns instead of cred->user->user_nsEric W. Biederman
Optimize performance and prepare for the removal of the user_ns reference from user_struct. Remove the slow long walk through cred->user->user_ns and instead go straight to cred->user_ns. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-03-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tileLinus Torvalds
Pull arch/tile (really asm-generic) update from Chris Metcalf: "These are a couple of asm-generic changes that apply to tile." * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: compat: use sys_sendfile64() implementation for sendfile syscall [PATCH v3] ipc: provide generic compat versions of IPC syscalls
2012-03-22Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge first batch of patches from Andrew Morton: "A few misc things and all the MM queue" * emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits) memcg: avoid THP split in task migration thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE memcg: clean up existing move charge code mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read() mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() mm/memcontrol.c: s/stealed/stolen/ memcg: fix performance of mem_cgroup_begin_update_page_stat() memcg: remove PCG_FILE_MAPPED memcg: use new logic for page stat accounting memcg: remove PCG_MOVE_LOCK flag from page_cgroup memcg: simplify move_account() check memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) memcg: kill dead prev_priority stubs memcg: remove PCG_CACHE page_cgroup flag memcg: let css_get_next() rely upon rcu_read_lock() cgroup: revert ss_id_lock to spinlock idr: make idr_get_next() good for rcu_read_lock() memcg: remove unnecessary thp check in page stat accounting memcg: remove redundant returns memcg: enum lru_list lru ...
2012-03-21hugetlbfs: fix alignment of huge page requestsSteven Truelove
When calling shmget() with SHM_HUGETLB, shmget aligns the request size to PAGE_SIZE, but this is not sufficient. Modify hugetlb_file_setup() to align requests to the huge page size, and to accept an address argument so that all alignment checks can be performed in hugetlb_file_setup(), rather than in its callers. Change newseg() and mmap_pgoff() to match the new prototype and eliminate a now redundant alignment check. [akpm@linux-foundation.org: fix build] Signed-off-by: Steven Truelove <steven.truelove@utoronto.ca> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "This is _not_ all; in particular, Miklos' and Jan's stuff is not there yet." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) ext4: initialization of ext4_li_mtx needs to be done earlier debugfs-related mode_t whack-a-mole hfsplus: add an ioctl to bless files hfsplus: change finder_info to u32 hfsplus: initialise userflags qnx4: new helper - try_extent() qnx4: get rid of qnx4_bread/qnx4_getblk take removal of PF_FORKNOEXEC to flush_old_exec() trim includes in inode.c um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it um: embed ->stub_pages[] into mmu_context gadgetfs: list_for_each_safe() misuse ocfs2: fix leaks on failure exits in module_init ecryptfs: make register_filesystem() the last potential failure exit ntfs: forgets to unregister sysctls on register_filesystem() failure logfs: missing cleanup on register_filesystem() failure jfs: mising cleanup on register_filesystem() failure make configfs_pin_fs() return root dentry on success configfs: configfs_create_dir() has parent dentry in dentry->d_parent configfs: sanitize configfs_create() ...
2012-03-20switch open-coded instances of d_make_root() to new helperAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-15[PATCH v3] ipc: provide generic compat versions of IPC syscallsChris Metcalf
When using the "compat" APIs, architectures will generally want to be able to make direct syscalls to msgsnd(), shmctl(), etc., and in the kernel we would want them to be handled directly by compat_sys_xxx() functions, as is true for other compat syscalls. However, for historical reasons, several of the existing compat IPC syscalls do not do this. semctl() expects a pointer to the fourth argument, instead of the fourth argument itself. msgsnd(), msgrcv() and shmat() expect arguments in different order. This change adds an ARCH_WANT_OLD_COMPAT_IPC config option that can be set to preserve this behavior for ports that use it (x86, sparc, powerpc, s390, and mips). No actual semantics are changed for those architectures, and there is only a minimal amount of code refactoring in ipc/compat.c. Newer architectures like tile (and perhaps future architectures such as arm64 and unicore64) should not select this option, and thus can avoid having any IPC-specific code at all in their architecture-specific compat layer. In the same vein, if this option is not selected, IPC_64 mode is assumed, since that's what the <asm-generic> headers expect. The workaround code in "tile" for msgsnd() and msgrcv() is removed with this change; it also fixes the bug that shmat() and semctl() were not being properly handled. Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
2012-02-14security: trim security.hAl Viro
Trim security.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>
2012-01-23SHM_UNLOCK: fix Unevictable pages stranded after swapHugh Dickins
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in find_get_pages") correctly fixed an infinite loop; but left a problem that find_get_pages() on shmem would return 0 (appearing to callers to mean end of tree) when it meets a run of nr_pages swap entries. The only uses of find_get_pages() on shmem are via pagevec_lookup(), called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's scan_mapping_unevictable_pages(). The first is already commented, and not worth worrying about; but the second can leave pages on the Unevictable list after an unusual sequence of swapping and locking. Fix that by using shmem_find_get_pages_and_swap() (then ignoring the swap) instead of pagevec_lookup(). But I don't want to contaminate vmscan.c with shmem internals, nor shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into shmem.c, renaming it shmem_unlock_mapping(); and rename check_move_unevictable_page() to check_move_unevictable_pages(), looping down an array of pages, oftentimes under the same lock. Leave out the "rotate unevictable list" block: that's a leftover from when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed handling involved looking at pages at tail of LRU. Was there significance to the sequence first ClearPageUnevictable, then test page_evictable, then SetPageUnevictable here? I think not, we're under LRU lock, and have no barriers between those. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23SHM_UNLOCK: fix long unpreemptible sectionHugh Dickins
scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages evictable again once the shared memory is unlocked. It does this with pagevec_lookup()s across the whole object (which might occupy most of memory), and takes 300ms to unlock 7GB here. A cond_resched() every PAGEVEC_SIZE pages would be good. However, KOSAKI-san points out that this is called under shmem.c's info->lock, and it's also under shm.c's shm_lock(), both spinlocks. There is no strong reason for that: we need to take these pages off the unevictable list soonish, but those locks are not required for it. So move the call to scan_mapping_unevictable_pages() from shmem.c's unlock handling up to shm.c's unlock handling. Remove the recently added barrier, not needed now we have spin_unlock() before the scan. Use get_file(), with subsequent fput(), to make sure we have a reference to mapping throughout scan_mapping_unevictable_pages(): that's something that was previously guaranteed by the shm_lock(). Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK time, and we lazily discover them to be Unevictable later, so it serves no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since pages still on pagevec are not marked Unevictable. The original code avoided redundant rescans by checking VM_LOCKED flag at its level: now avoid them by checking shp's SHM_LOCKED. The original code called scan_mapping_unevictable_pages() on a locked area at shm_destroy() time: perhaps we once had accounting cross-checks which required that, but not now, so skip the overhead and just let inode eviction deal with them. Put check_move_unevictable_page() and scan_mapping_unevictable_pages() under CONFIG_SHMEM (with stub for the TINY case when ramfs is used), more as comment than to save space; comment them used for SHM_UNLOCK. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23ipc/mqueue: simplify reading msgqueue limitDavidlohr Bueso
Because the current task is being used to get the limit, we can simply use rlimit() instead of task_rlimit(). Signed-off-by: Davidlohr Bueso <dave@gnu.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10user namespace: make signal.c respect user namespacesSerge E. Hallyn
ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's user namespace. (new, thanks Oleg) __send_signal: convert current's uid to the recipient's user namespace for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks again :) do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's user namespace ptrace_signal maps parent's uid into current's user namespace before including in signal to current. IIUC Oleg has argued that this shouldn't matter as the debugger will play with it, but it seems like not converting the value currently being set is misleading. Changelog: Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to simplify callers and help make clear what we are translating (which uid into which namespace). Passing the target task would make callers even easier to read, but we pass in user_ns because current_user_ns() != task_cred_xxx(current, user_ns). Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock in ptrace_signal(). Sep 23: In send_signal(), detect when (user) signal is coming from an ancestor or unrelated user namespace. Pass that on to __send_signal, which sets si_uid to 0 or overflowuid if needed. Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all SI_FROMKERNEL cases at callers, because we can't assume sender is current in those cases. Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid Nov 10: (akpm) make the !CONFIG_USER_NS case clearer Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> From: Serge Hallyn <serge.hallyn@canonical.com> Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2) Eric Biederman pointed out that passing info is a bug and could lead to a NULL pointer deref to boot. A collection of signal, securebits, filecaps, cap_bounds, and a few other ltp tests passed with this kernel. Changelog: Nov 18: previous patch missed a leading '&' Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> From: Dan Carpenter <dan.carpenter@oracle.com> Subject: ipc/mqueue: lock() => unlock() typo There was a double lock typo introduced in b085f4bd6b21 "user namespace: make signal.c respect user namespaces" Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-03switch mq_open() to umode_tAl Viro
2012-01-03mqueue: propagate umode_tAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03switch ->create() to umode_tAl Viro
vfs_create() ignores everything outside of 16bit subset of its mode argument; switching it to umode_t is obviously equivalent and it's the only caller of the method Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: fix the stupidity with i_dentry in inode destructorsAl Viro
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once(); the cost of taking it into inode_init_always() will be negligible for pipes and sockets and negative for everything else. Not to mention the removal of boilerplate code from ->destroy_inode() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-09... and the same kind of leak for mqueueAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-02ipc/sem.c: remove private structures from public header fileManfred Spraul
include/linux/sem.h contains several structures that are only used within ipc/sem.c. The patch moves them into ipc/sem.c - there is no need to expose the structures to the whole kernel. No functional changes, only whitespace cleanups and 80-char per line fixes. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02ipc/sem.c: handle spurious wakeupsManfred Spraul
semtimedop() does not handle spurious wakeups, it returns -EINTR to user space. Most other schedule() users would just loop and not return to user space. The patch adds such a loop to semtimedop() Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02ipc/sem.c: fix return code race with semop vs. semop +semctl(IPC_RMID)Manfred Spraul
sys_semtimedop() may return -EIDRM although the semaphore operation completed successfully: thread 1: thread 2: semtimedop(), sleeps semop(): * acquires sem_lock() semtimedop() woken up due to timeout sem_lock() loops * notices that thread 2 could be completed. * performs the operations that thread 2 is sleeping on. * marks the semaphore operation as IN_WAKEUP * drops sem_lock(), does wakeup, sets return code to 0 * thread delayed due to interrupt, whatever * returns to user space * thread still delayed semctl(IPC_RMID) * acquires sem_lock() * ipc_rmid(), ipcp->deleted=1 * drops sem_lock() * thread finally continues - but seem_lock() now fails due to ipcp->deleted == 1 * returns -EIDRM instead of 0 The fix is trivial: Always use the return code in queue.status. In real world, the race probably doesn't matter: If the semaphore array is destroyed, the app is probably not interested if the last operation succeeded or was already cancelled. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31ipc/mqueue.c: fix wrong use of schedule_hrtimeout_range_clock()Wanlong Gao
Fix the wrong use of schedule_hrtimeout_range_clock() in wq_sleep(), although it is harmless for the syscall mq_timed* now. It was introduced by 9ca7d8e ("mqueue: Convert message queue timeout to use hrtimers"). Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Cc: Carsten Emde <C.Emde@osadl.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-04Do 'shm_init_ns()' in an early pure_initcallLinus Torvalds
This isn't really critical any more, since other patches (commit 298507d4d2cf: "shm: optimize exit_shm()") have caused us to not actually need to touch the rw_mutex unless there are actual shm segments associated with the namespace, but we really should do tne shm_init_ns() earlier than we do now. This, together with commit 288d5abec831 ("Boot up with usermodehelper disabled") will mean that we really do initialize the initial ipc namespace data structure before we run any tasks. Tested-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03shm: optimize exit_shm()Vasiliy Kulikov
We may optimistically check .in_use == 0 without holding the rw_mutex: it's the common case, and if it's zero, there certainly won't be any segments associated with us. After taking the lock, the idr_for_each() will do the right thing, so we could now drop the re-check inside the lock without any real cost. But it won't hurt. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03shm: fix wrong testsVasiliy Kulikov
Commit 4c677e2eefdb ("shm: optimize locking and ipc_namespace getting") introduced a copy-paste bug. Due to the bug cycle optimizations were disabled. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-30shm: optimize locking and ipc_namespace gettingVasiliy Kulikov
shm_lock() does a lookup of shm segment in shm_ids(ns).ipcs_idr, which is redundant as we already know shmid_kernel address. An actual lock is also not required for reads until we really want to destroy the segment. exit_shm() and shm_destroy_orphaned() may avoid the loop by checking whether there is at least one segment in current ipc_namespace. The check of nsproxy and ipc_ns against NULL is redundant as exit_shm() is called from do_exit() before the call to exit_notify(), so the dereferencing current->nsproxy->ipc_ns is guaranteed to be safe. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-30shm: handle separate PID namespaces caseVasiliy Kulikov
shm_try_destroy_orphaned() and shm_try_destroy_current() didn't handle the case of separate PID namespaces, but a single IPC namespace. If there are tasks with the same PID values using the same shmem object, the wrong destroy decision could be reached. On shm segment creation store the pointer to the creator task in shmid_kernel->shm_creator field and zero it on task exit. Then use the ->shm_creator insread of shm_cprid in both functions. As shmid_kernel object is already locked at this stage, no additional locking is needed. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26ipc: introduce shm_rmid_forced sysctlVasiliy Kulikov
Add support for the shm_rmid_forced sysctl. If set to 1, all shared memory objects in current ipc namespace will be automatically forced to use IPC_RMID. The POSIX way of handling shmem allows one to create shm objects and call shmdt(), leaving shm object associated with no process, thus consuming memory not counted via rlimits. With shm_rmid_forced=1 the shared memory object is counted at least for one process, so OOM killer may effectively kill the fat process holding the shared memory. It obviously breaks POSIX - some programs relying on the feature would stop working. So set shm_rmid_forced=1 only if you're sure nobody uses "orphaned" memory. Use shm_rmid_forced=0 by default for compatability reasons. The feature was previously impemented in -ow as a configure option. [akpm@linux-foundation.org: fix documentation, per Randy] [akpm@linux-foundation.org: fix warning] [akpm@linux-foundation.org: readability/conventionality tweaks] [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout] Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Solar Designer <solar@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26ipc/mqueue.c: fix mq_open() return valueJiri Slaby
We return ENOMEM from mqueue_get_inode even when we have enough memory. Namely in case the system rlimit of mqueue was reached. This error propagates to mq_queue and user sees the error unexpectedly. So fix this up to properly return EMFILE as described in the manpage: EMFILE The process already has the maximum number of files and message queues open. instead of: ENOMEM Insufficient memory. With the previous patch we just switch to ERR_PTR/PTR_ERR/IS_ERR error handling here. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26ipc/mqueue.c: refactor failure handlingJiri Slaby
If new_inode fails to allocate an inode we need only to return with NULL. But now we test the opposite and have all the work in a nested block. So do the opposite to save one indentation level (and remove unnecessary line breaks). This is only a preparation/cleanup for the next patch where we fix up return values from mqueue_get_inode. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25ipc/sem.c: fix race with concurrent semtimedop() timeouts and IPC_RMIDManfred Spraul
If a semaphore array is removed and in parallel a sleeping task is woken up (signal or timeout, does not matter), then the woken up task does not wait until wake_up_sem_queue_do() is completed. This will cause crashes, because wake_up_sem_queue_do() will read from a stale pointer. The fix is simple: Regardless of anything, always call get_queue_result(). This function waits until wake_up_sem_queue_do() has finished it's task. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=27142 Reported-by: Yuriy Yevtukhov <yuriy@ucoz.com> Reported-by: Harald Laabs <kernel@dasr.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: <stable@kernel.org> [2.6.35+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.
2011-07-20fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlersJosef Bacik
Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20ipc,rcu: Convert call_rcu(ipc_immediate_free) to kfree_rcu()Lai Jiangshan
The rcu callback ipc_immediate_free() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(ipc_immediate_free). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20ipc,rcu: Convert call_rcu(free_un) to kfree_rcu()Lai Jiangshan
The rcu callback free_un() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_un). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-26mm: don't access vm_flags as 'int'KOSAKI Motohiro
The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor 'unsigned int'. This patch fixes such misuse. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [ Changed to use a typedef - we'll extend it to cover more cases later, since there has been discussion about making it a 64-bit type.. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-10ns proc: Add support for the ipc namespaceEric W. Biederman
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>