path: root/net/socket.c
AgeCommit message (Collapse)Author
2013-10-03net: heap overflow in __audit_sockaddr()Dan Carpenter
We need to cap ->msg_namelen or it leads to a buffer overflow when we to the memcpy() in __audit_sockaddr(). It requires CAP_AUDIT_CONTROL to exploit this bug. The call tree is: ___sys_recvmsg() move_addr_to_user() audit_sockaddr() __audit_sockaddr() Reported-by: Jüri Aedla <juri.aedla@gmail.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-13Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds
Pull aio changes from Ben LaHaise: "First off, sorry for this pull request being late in the merge window. Al had raised a couple of concerns about 2 items in the series below. I addressed the first issue (the race introduced by Gu's use of mm_populate()), but he has not provided any further details on how he wants to rework the anon_inode.c changes (which were sent out months ago but have yet to be commented on). The bulk of the changes have been sitting in the -next tree for a few months, with all the issues raised being addressed" * git://git.kvack.org/~bcrl/aio-next: (22 commits) aio: rcu_read_lock protection for new rcu_dereference calls aio: fix race in ring buffer page lookup introduced by page migration support aio: fix rcu sparse warnings introduced by ioctx table lookup patch aio: remove unnecessary debugging from aio_free_ring() aio: table lookup: verify ctx pointer staging/lustre: kiocb->ki_left is removed aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" aio: be defensive to ensure request batching is non-zero instead of BUG_ON() aio: convert the ioctx list to table lookup v3 aio: double aio_max_nr in calculations aio: Kill ki_dtor aio: Kill ki_users aio: Kill unneeded kiocb members aio: Kill aio_rw_vect_retry() aio: Don't use ctx->tail unnecessarily aio: io_cancel() no longer returns the io_event aio: percpu ioctx refcount aio: percpu reqs_available aio: reqs_active -> reqs_available aio: fix build when migration is disabled ...
2013-09-11kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user()Mathieu Desnoyers
I found the following pattern that leads in to interesting findings: grep -r "ret.*|=.*__put_user" * grep -r "ret.*|=.*__get_user" * grep -r "ret.*|=.*__copy" * The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat, since those appear in compat code, we could probably expect the kernel addresses not to be reachable in the lower 32-bit range, so I think they might not be exploitable. For the "__get_user" cases, I don't think those are exploitable: the worse that can happen is that the kernel will copy kernel memory into in-kernel buffers, and will fail immediately afterward. The alpha csum_partial_copy_from_user() seems to be missing the access_ok() check entirely. The fix is inspired from x86. This could lead to information leak on alpha. I also noticed that many architectures map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I wonder if the latter is performing the access checks on every architectures. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-01net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLLCong Wang
Eliezer renames several *ll_poll to *busy_poll, but forgets CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too. Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30aio: Kill ki_dtorKent Overstreet
sock_aio_dtor() is dead code - and stuff that does need to do cleanup can simply do it before calling aio_complete(). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30aio: Kill aio_rw_vect_retry()Kent Overstreet
This code doesn't serve any purpose anymore, since the aio retry infrastructure has been removed. This change should be safe because aio_read/write are also used for synchronous IO, and called from do_sync_read()/do_sync_write() - and there's no looping done in the sync case (the read and write syscalls). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-10net: rename busy poll socket op and globalsEliezer Tamir
Rename LL_SO to BUSY_POLL_SO Rename sysctl_net_ll_{read,poll} to sysctl_busy_{read,poll} Fix up users of these variables. Fix documentation for sysctl. a patch for the socket.7 man page will follow separately, because of limitations of my mail setup. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10net: rename include/net/ll_poll.h to include/net/busy_poll.hEliezer Tamir
Rename the file and correct all the places where it is included. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-08net: rename low latency sockets functions to busy pollEliezer Tamir
Rename functions in include/net/ll_poll.h to busy wait. Clarify documentation about expected power use increase. Rename POLL_LL to POLL_BUSY_LOOP. Add need_resched() testing to poll/select busy loops. Note, that in select and poll can_busy_poll is dynamic and is updated continuously to reflect the existence of supported sockets with valid queue information. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25net: poll/select low latency socket supportEliezer Tamir
select/poll busy-poll support. Split sysctl value into two separate ones, one for read and one for poll. updated Documentation/sysctl/net.txt Add a new poll flag POLL_LL. When this flag is set, sock_poll will call sk_poll_ll if possible. sock_poll sets this flag in its return value to indicate to select/poll when a socket that can busy poll is found. When poll/select have nothing to report, call the low-level sock_poll again until we are out of time or we find something. Once the system call finds something, it stops setting POLL_LL, so it can return the result to the user ASAP. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17net: add socket option for low latency pollingEliezer Tamir
adds a socket option for low latency polling. This allows overriding the global sysctl value with a per-socket one. Unexport sysctl_net_ll_poll since for now it's not needed in modules. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17net: change sysctl_net_ll_poll into an unsigned intEliezer Tamir
There is no reason for sysctl_net_ll_poll to be an unsigned long. Change it into an unsigned int. Fix the proc handler. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10net: add low latency socket pollEliezer Tamir
Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge 'net' into 'net-next' to get the MSG_CMSG_COMPAT regression fix. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06net: Unbreak compat_sys_{send,recv}msgAndy Lutomirski
I broke them in this commit: commit 1be374a0518a288147c6a7398792583200a67261 Author: Andy Lutomirski <luto@amacapital.net> Date: Wed May 22 14:07:44 2013 -0700 net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msg This patch adds __sys_sendmsg and __sys_sendmsg as common helpers that accept MSG_CMSG_COMPAT and blocks MSG_CMSG_COMPAT at the syscall entrypoints. It also reverts some unnecessary checks in sys_socketcall. Apparently I was suffering from underscore blindness the first time around. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Conflicts: net/netfilter/nf_log.c The conflict in nf_log.c is that in 'net' we added CONFIG_PROC_FS protection around foo_proc_entry() calls to fix a build failure, whereas in Pablo's tree a guard if() test around a call is remove_proc_entry() was removed. Trivially resolved. Pablo Neira Ayuso says: ==================== The following patchset contains the first batch of Netfilter/IPVS updates for your net-next tree, they are: * Three patches with improvements and code refactorization for nfnetlink_queue, from Florian Westphal. * FTP helper now parses replies without brackets, as RFC1123 recommends, from Jeff Mahoney. * Rise a warning to tell everyone about ULOG deprecation, NFLOG has been already in the kernel tree for long time and supersedes the old logging over netlink stub, from myself. * Don't panic if we fail to load netfilter core framework, just bail out instead, from myself. * Add cond_resched_rcu, used by IPVS to allow rescheduling while walking over big hashtables, from Simon Horman. * Change type of IPVS sysctl_sync_qlen_max sysctl to avoid possible overflow, from Zhang Yanfei. * Use strlcpy instead of strncpy to skip zeroing of already initialized area to write the extension names in ebtables, from Chen Gang. * Use already existing per-cpu notrack object from xt_CT, from Eric Dumazet. * Save explicit socket lookup in xt_socket now that we have early demux, also from Eric Dumazet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msgAndy Lutomirski
To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, trinity@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>, netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net> Subject: [PATCH 5/5] net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msg MSG_CMSG_COMPAT is (AFAIK) not intended to be part of the API -- it's a hack that steals a bit to indicate to other networking code that a compat entry was used. So don't allow it from a non-compat syscall. This prevents an oops when running this code: int main() { int s; struct sockaddr_in addr; struct msghdr *hdr; char *highpage = mmap((void*)(TASK_SIZE_MAX - 4096), 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0); if (highpage == MAP_FAILED) err(1, "mmap"); s = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); if (s == -1) err(1, "socket"); addr.sin_family = AF_INET; addr.sin_port = htons(1); addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK); if (connect(s, (struct sockaddr*)&addr, sizeof(addr)) != 0) err(1, "connect"); void *evil = highpage + 4096 - COMPAT_MSGHDR_SIZE; printf("Evil address is %p\n", evil); if (syscall(__NR_sendmmsg, s, evil, 1, MSG_CMSG_COMPAT) < 0) err(1, "sendmmsg"); return 0; } Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-23netfilter: don't panic on error while walking through the init pathPablo Neira Ayuso
Don't panic if we hit an error while adding the nf_log or pernet netfilter support, just bail out. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
2013-05-11Merge git://git.infradead.org/users/eparis/auditLinus Torvalds
Pull audit changes from Eric Paris: "Al used to send pull requests every couple of years but he told me to just start pushing them to you directly. Our touching outside of core audit code is pretty straight forward. A couple of interface changes which hit net/. A simple argument bug calling audit functions in namei.c and the removal of some assembly branch prediction code on ppc" * git://git.infradead.org/users/eparis/audit: (31 commits) audit: fix message spacing printing auid Revert "audit: move kaudit thread start from auditd registration to kaudit init" audit: vfs: fix audit_inode call in O_CREAT case of do_last audit: Make testing for a valid loginuid explicit. audit: fix event coverage of AUDIT_ANOM_LINK audit: use spin_lock in audit_receive_msg to process tty logging audit: do not needlessly take a lock in tty_audit_exit audit: do not needlessly take a spinlock in copy_signal audit: add an option to control logging of passwords with pam_tty_audit audit: use spin_lock_irqsave/restore in audit tty code helper for some session id stuff audit: use a consistent audit helper to log lsm information audit: push loginuid and sessionid processing down audit: stop pushing loginid, uid, sessionid as arguments audit: remove the old depricated kernel interface audit: make validity checking generic audit: allow checking the type of audit message in the user filter audit: fix build break when AUDIT_DEBUG == 2 audit: remove duplicate export of audit_enabled Audit: do not print error when LSMs disabled ...
2013-05-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...
2013-04-29sock_close() couldn't have been called with NULL inode since at least 2.1.earlyAl Viro
... if not since 0.99 or so. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-19net: socket: move ktime2ts to ktime header apiDaniel Borkmann
Currently, ktime2ts is a small helper function that is only used in net/socket.c. Move this helper into the ktime API as a small inline function, so that i) it's maintained together with ktime routines, and ii) also other files can make use of it. The function is named ktime_to_timespec_cond() and placed into the generic part of ktime, since we internally make use of ktime_to_timespec(). ktime_to_timespec() itself does not check the ktime variable for zero, hence, we name this function ktime_to_timespec_cond() for only a conditional conversion, and adapt its users to it. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-14net: sock: make sock_tx_timestamp voidDaniel Borkmann
Currently, sock_tx_timestamp() always returns 0. The comment that describes the sock_tx_timestamp() function wrongly says that it returns an error when an invalid argument is passed (from commit 20d4947353be, ``net: socket infrastructure for SO_TIMESTAMPING''). Make the function void, so that we can also remove all the unneeded if conditions that check for such a _non-existant_ error case in the output path. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-10kernel: audit: beautify code, for extern function, better to check its ↵Chen Gang
parameters by itself __audit_socketcall is an extern function. better to check its parameters by itself. also can return error code, when fail (find invalid parameters). also use macro instead of real hard code number also give related comments for it. Signed-off-by: Chen Gang <gang.chen@asianux.com> [eparis: fix the return value when !CONFIG_AUDIT] Signed-off-by: Eric Paris <eparis@redhat.com>
2013-02-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-26get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zeroAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22fs: Preserve error code in get_empty_filp(), part 2Anatol Pomozov
Allocating a file structure in function get_empty_filp() might fail because of several reasons: - not enough memory for file structures - operation is not allowed - user is over its limit Currently the function returns NULL in all cases and we loose the exact reason of the error. All callers of get_empty_filp() assume that the function can fail with ENFILE only. Return error through pointer. Change all callers to preserve this error code. [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit (things remaining here deal with alloc_file()), removed pipe(2) behaviour change] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-11ethtool: fix sparse warningStephen Hemminger
Fixes sparse complaints about dropping __user in casts. warning: cast removes address space of expression Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-31wanrouter: completely decouple obsolete code from kernel.Paul Gortmaker
The original suggestion to delete wanrouter started earlier with the mainline commit f0d1b3c2bcc5de8a17af5f2274f7fcde8292b5fc ("net/wanrouter: Deprecate and schedule for removal") in May 2012. More importantly, Dan Carpenter found[1] that the driver had a fundamental breakage introduced back in 2008, with commit 7be6065b39c3 ("netdevice wanrouter: Convert directly reference of netdev->priv"). So we know with certainty that the code hasn't been used by anyone willing to at least take the effort to send an e-mail report of breakage for at least 4 years. This commit does a decouple of the wanrouter subsystem, by going after the Makefile/Kconfig and similar files, so that these mainline files that we are keeping do not have the big wanrouter file/driver deletion commit tied into their history. Once this commit is in place, we then can remove the obsolete cyclomx drivers and similar that have a dependency on CONFIG_WAN_ROUTER_DRIVERS. [1] http://www.spinics.net/lists/netdev/msg218670.html Originally-by: Joe Perches <joe@perches.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-10-26cgroup: net_cls: Rework update socket logicDaniel Wagner
The cgroup logic part of net_cls is very similar as the one in net_prio. Let's stream line the net_cls logic with the net_prio one. The net_prio update logic was changed by following commit (note there were some changes necessary later on) commit 406a3c638ce8b17d9704052c07955490f732c2b8 Author: John Fastabend <john.r.fastabend@intel.com> Date: Fri Jul 20 10:39:25 2012 +0000 net: netprio_cgroup: rework update socket logic Instead of updating the sk_cgrp_prioidx struct field on every send this only updates the field when a task is moved via cgroup infrastructure. This allows sockets that may be used by a kernel worker thread to be managed. For example in the iscsi case today a user can put iscsid in a netprio cgroup and control traffic will be sent with the correct sk_cgrp_prioidx value set but as soon as data is sent the kernel worker thread isssues a send and sk_cgrp_prioidx is updated with the kernel worker threads value which is the default case. It seems more correct to only update the field when the user explicitly sets it via control group infrastructure. This allows the users to manage sockets that may be used with other threads. Since classid is now updated when the task is moved between the cgroups, we don't have to call sock_update_classid() from various places to ensure we always using the latest classid value. [v2: Use iterate_fd() instead of open coding] Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Li Zefan <lizefan@huawei.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Joe Perches <joe@perches.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: <netdev@vger.kernel.org> Cc: <cgroups@vger.kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-26cgroup: net_cls: Pass in task to sock_update_classid()Daniel Wagner
sock_update_classid() assumes that the update operation always are applied on the current task. sock_update_classid() needs to know on which tasks to work on in order to be able to migrate task between cgroups using the struct cgroup_subsys attach() callback. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Joe Perches <joe@perches.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: <netdev@vger.kernel.org> Cc: <cgroups@vger.kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...
2012-09-27net: remove sk_init() helperEric Dumazet
It seems sk_init() has no value today and even does strange things : # grep . /proc/sys/net/core/?mem_* /proc/sys/net/core/rmem_default:212992 /proc/sys/net/core/rmem_max:131071 /proc/sys/net/core/wmem_default:212992 /proc/sys/net/core/wmem_max:131071 We can remove it completely. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shan Wei <davidshan@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-26unexport sock_map_fd(), switch to sock_alloc_file()Al Viro
Both modular callers of sock_map_fd() had been buggy; sctp one leaks descriptor and file if copy_to_user() fails, 9p one shouldn't be exposing file in the descriptor table at all. Switch both to sock_alloc_file(), export it, unexport sock_map_fd() and make it static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26take descriptor handling from sock_alloc_file() to callersAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/netfilter/nfnetlink_log.c net/netfilter/xt_LOG.c Rather easy conflict resolution, the 'net' tree had bug fixes to make sure we checked if a socket is a time-wait one or not and elide the logging code if so. Whereas on the 'net-next' side we are calculating the UID and GID from the creds using different interfaces due to the user namespace changes from Eric Biederman. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05Fix order of arguments to compat_put_time[spec|val]Mikulas Patocka
Commit 644595f89620 ("compat: Handle COMPAT_USE_64BIT_TIME in net/socket.c") introduced a bug where the helper functions to take either a 64-bit or compat time[spec|val] got the arguments in the wrong order, passing the kernel stack pointer off as a user pointer (and vice versa). Because of the user address range check, that in turn then causes an EFAULT due to the user pointer range checking failing for the kernel address. Incorrectly resuling in a failed system call for 32-bit processes with a 64-bit kernel. On odder architectures like HP-PA (with separate user/kernel address spaces), it can be used read kernel memory. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-04net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd ↵Masatake YAMATO
entries lsof reports some of socket descriptors as "can't identify protocol" like: [yamato@localhost]/tmp% sudo lsof | grep dbus | grep iden dbus-daem 652 dbus 6u sock ... 17812 can't identify protocol dbus-daem 652 dbus 34u sock ... 24689 can't identify protocol dbus-daem 652 dbus 42u sock ... 24739 can't identify protocol dbus-daem 652 dbus 48u sock ... 22329 can't identify protocol ... lsof cannot resolve the protocol used in a socket because procfs doesn't provide the map between inode number on sockfs and protocol type of the socket. For improving the situation this patch adds an extended attribute named 'system.sockprotoname' in which the protocol name for /proc/PID/fd/SOCKET is stored. So lsof can know the protocol for a given /proc/PID/fd/SOCKET with getxattr system call. A few weeks ago I submitted a patch for the same purpose. The patch was introduced /proc/net/sockfs which enumerates inodes and protocols of all sockets alive on a system. However, it was rejected because (1) a global lock was needed, and (2) the layout of struct socket was changed with the patch. This patch doesn't use any global lock; and doesn't change the layout of any structs. In this patch, a protocol name is stored to dentry->d_name of sockfs when new socket is associated with a file descriptor. Before this patch dentry->d_name was not used; it was just filled with empty string. lsof may use an extended attribute named 'system.sockprotoname' to retrieve the value of dentry->d_name. It is nice if we can see the protocol name with ls -l /proc/PID/fd. However, "socket:[#INODE]", the name format returned from sockfs_dname() was already defined. To keep the compatibility between kernel and user land, the extended attribute is used to prepare the value of dentry->d_name. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-15net: fix info leak in compat dev_ifconf()Mathias Krause
The implementation of dev_ifconf() for the compat ioctl interface uses an intermediate ifc structure allocated in userland for the duration of the syscall. Though, it fails to initialize the padding bytes inserted for alignment and that for leaks four bytes of kernel stack. Add an explicit memset(0) before filling the structure to avoid the info leak. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22net: netprio_cgroup: rework update socket logicJohn Fastabend
Instead of updating the sk_cgrp_prioidx struct field on every send this only updates the field when a task is moved via cgroup infrastructure. This allows sockets that may be used by a kernel worker thread to be managed. For example in the iscsi case today a user can put iscsid in a netprio cgroup and control traffic will be sent with the correct sk_cgrp_prioidx value set but as soon as data is sent the kernel worker thread isssues a send and sk_cgrp_prioidx is updated with the kernel worker threads value which is the default case. It seems more correct to only update the field when the user explicitly sets it via control group infrastructure. This allows the users to manage sockets that may be used with other threads. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20tun: fix a crash bug and a memory leakMikulas Patocka
This patch fixes a crash tun_chr_close -> netdev_run_todo -> tun_free_netdev -> sk_release_kernel -> sock_release -> iput(SOCK_INODE(sock)) introduced by commit 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d The problem is that this socket is embedded in struct tun_struct, it has no inode, iput is called on invalid inode, which modifies invalid memory and optionally causes a crash. sock_release also decrements sockets_in_use, this causes a bug that "sockets: used" field in /proc/*/net/sockstat keeps on decreasing when creating and closing tun devices. This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs sock_release to not free the inode and not decrement sockets_in_use, fixing both memory corruption and sockets_in_use underflow. It should be backported to 3.3 an 3.4 stabke. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: stable@kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-22Merge branch 'for-3.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "Contains Alex Shi's three patches to remove percpu_xxx() which overlap with this_cpu_xxx(). There shouldn't be any functional change." * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: remove percpu_xxx() functions x86: replace percpu_xxx funcs with this_cpu_xxx net: replace percpu_xxx funcs with this_cpu_xxx or __this_cpu_xxx
2012-05-15net: Convert net_ratelimit uses to net_<level>_ratelimitedJoe Perches
Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-14net: replace percpu_xxx funcs with this_cpu_xxx or __this_cpu_xxxAlex Shi
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them for further code clean up. And in preempt safe scenario, __this_cpu_xxx funcs may has a bit better performance since __this_cpu_xxx has no redundant preempt_enable/preempt_disable on some architectures. Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-21net: change big iov allocationsEric Dumazet
iov of more than 8 entries are allocated in sendmsg()/recvmsg() through sock_kmalloc() As these allocations are temporary only and small enough, it makes sense to use plain kmalloc() and avoid sk_omem_alloc atomic overhead. Slightly changed fast path to be even faster. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mike Waychison <mikew@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20net sysctl: Initialize the network sysctls sooner to avoid problems.Eric W. Biederman
If the netfilter code is modified to use register_net_sysctl_table the kernel fails to boot because the per net sysctl infrasturce is not setup soon enough. So to avoid races call net_sysctl_init from sock_init(). Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15net: cleanup unsigned to unsigned intEric Dumazet
Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05tcp: tcp_sendpages() should call tcp_push() onceEric Dumazet
commit 2f533844242 (tcp: allow splice() to build full TSO packets) added a regression for splice() calls using SPLICE_F_MORE. We need to call tcp_flush() at the end of the last page processed in tcp_sendpages(), or else transmits can be deferred and future sends stall. Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but with different semantic. For all sendpage() providers, its a transparent change. Only sock_sendpage() and tcp_sendpages() can differentiate the two different flags provided by pipe_to_sendpage() Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-29Merge branch 'x86-x32-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x32 support for x86-64 from Ingo Molnar: "This tree introduces the X32 binary format and execution mode for x86: 32-bit data space binaries using 64-bit instructions and 64-bit kernel syscalls. This allows applications whose working set fits into a 32 bits address space to make use of 64-bit instructions while using a 32-bit address space with shorter pointers, more compressed data structures, etc." Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c} * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) x32: Fix alignment fail in struct compat_siginfo x32: Fix stupid ia32/x32 inversion in the siginfo format x32: Add ptrace for x32 x32: Switch to a 64-bit clock_t x32: Provide separate is_ia32_task() and is_x32_task() predicates x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls x86/x32: Fix the binutils auto-detect x32: Warn and disable rather than error if binutils too old x32: Only clear TIF_X32 flag once x32: Make sure TS_COMPAT is cleared for x32 tasks fs: Remove missed ->fds_bits from cessation use of fd_set structs internally fs: Fix close_on_exec pointer in alloc_fdtable x32: Drop non-__vdso weak symbols from the x32 VDSO x32: Fix coding style violations in the x32 VDSO code x32: Add x32 VDSO support x32: Allow x32 to be configured x32: If configured, add x32 system calls to system call tables x32: Handle process creation x32: Signal-related system calls x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h> ...
2012-03-11net: get rid of some pointless casts to sockaddrMaciej Żenczykowski
The following 4 functions: move_addr_to_kernel move_addr_to_user verify_iovec verify_compat_iovec are always effectively called with a sockaddr_storage. Make this explicit by changing their signature. This removes a large number of casts from sockaddr_storage to sockaddr. Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>