diff options
author | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 15:20:36 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 15:20:36 -0700 |
commit | 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch) | |
tree | 0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation | |
download | linux-stericsson-9e734775f7c22d2f89943ad6c745571f1930105f.tar.gz |
Linux-2.6.12-rc2v2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!
Diffstat (limited to 'Documentation')
722 files changed, 177485 insertions, 0 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX new file mode 100644 index 000000000000..72dc90f8f4a7 --- /dev/null +++ b/Documentation/00-INDEX @@ -0,0 +1,294 @@ + +This is a brief list of all the files in ./linux/Documentation and what +they contain. If you add a documentation file, please list it here in +alphabetical order as well, or risk being hunted down like a rabid dog. +Please try and keep the descriptions small enough to fit on one line. + Thanks -- Paul G. + +Following translations are available on the WWW: + + - Japanese, maintained by the JF Project (JF@linux.or.jp), at + http://www.linux.or.jp/JF/ + +00-INDEX + - this file. +BK-usage/ + - directory with info on BitKeeper. +BUG-HUNTING + - brute force method of doing binary search of patches to find bug. +Changes + - list of changes that break older software packages. +CodingStyle + - how the boss likes the C code in the kernel to look. +DMA-API.txt + - DMA API, pci_ API & extensions for non-consistent memory machines. +DMA-mapping.txt + - info for PCI drivers using DMA portably across all platforms. +DocBook/ + - directory with DocBook templates etc. for kernel documentation. +IO-mapping.txt + - how to access I/O mapped memory from within device drivers. +IPMI.txt + - info on Linux Intelligent Platform Management Interface (IPMI) Driver. +IRQ-affinity.txt + - how to select which CPU(s) handle which interrupt events on SMP. +ManagementStyle + - how to (attempt to) manage kernel hackers. +MSI-HOWTO.txt + - the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ. +RCU/ + - directory with info on RCU (read-copy update). +README.DAC960 + - info on Mylex DAC960/DAC1100 PCI RAID Controller Driver for Linux. +SAK.txt + - info on Secure Attention Keys. +SubmittingDrivers + - procedure to get a new driver source included into the kernel tree. +SubmittingPatches + - procedure to get a source patch included into the kernel tree. +VGA-softcursor.txt + - how to change your VGA cursor from a blinking underscore. +arm/ + - directory with info about Linux on the ARM architecture. +basic_profiling.txt + - basic instructions for those who wants to profile Linux kernel. +binfmt_misc.txt + - info on the kernel support for extra binary formats. +block/ + - info on the Block I/O (BIO) layer. +cachetlb.txt + - describes the cache/TLB flushing interfaces Linux uses. +cciss.txt + - info, major/minor #'s for Compaq's SMART Array Controllers. +cdrom/ + - directory with information on the CD-ROM drivers that Linux has. +cli-sti-removal.txt + - cli()/sti() removal guide. +computone.txt + - info on Computone Intelliport II/Plus Multiport Serial Driver. +cpqarray.txt + - info on using Compaq's SMART2 Intelligent Disk Array Controllers. +cpu-freq/ + - info on CPU frequency and voltage scaling. +cris/ + - directory with info about Linux on CRIS architecture. +crypto/ + - directory with info on the Crypto API. +debugging-modules.txt + - some notes on debugging modules after Linux 2.6.3. +device-mapper/ + - directory with info on Device Mapper. +devices.txt + - plain ASCII listing of all the nodes in /dev/ with major minor #'s. +digiepca.txt + - info on Digi Intl. {PC,PCI,EISA}Xx and Xem series cards. +dnotify.txt + - info about directory notification in Linux. +driver-model/ + - directory with info about Linux driver model. +dvb/ + - info on Linux Digital Video Broadcast (DVB) subsystem. +early-userspace/ + - info about initramfs, klibc, and userspace early during boot. +eisa.txt + - info on EISA bus support. +exception.txt + - how Linux v2.2 handles exceptions without verify_area etc. +fb/ + - directory with info on the frame buffer graphics abstraction layer. +filesystems/ + - directory with info on the various filesystems that Linux supports. +firmware_class/ + - request_firmware() hotplug interface info. +floppy.txt + - notes and driver options for the floppy disk driver. +ftape.txt + - notes about the floppy tape device driver. +hayes-esp.txt + - info on using the Hayes ESP serial driver. +highuid.txt + - notes on the change from 16 bit to 32 bit user/group IDs. +hpet.txt + - High Precision Event Timer Driver for Linux. +hw_random.txt + - info on Linux support for random number generator in i8xx chipsets. +i2c/ + - directory with info about the I2C bus/protocol (2 wire, kHz speed). +i2o/ + - directory with info about the Linux I2O subsystem. +i386/ + - directory with info about Linux on Intel 32 bit architecture. +ia64/ + - directory with info about Linux on Intel 64 bit architecture. +ide.txt + - important info for users of ATA devices (IDE/EIDE disks and CD-ROMS). +initrd.txt + - how to use the RAM disk as an initial/temporary root filesystem. +input/ + - info on Linux input device support. +io_ordering.txt + - info on ordering I/O writes to memory-mapped addresses. +ioctl-number.txt + - how to implement and register device/driver ioctl calls. +iostats.txt + - info on I/O statistics Linux kernel provides. +isapnp.txt + - info on Linux ISA Plug & Play support. +isdn/ + - directory with info on the Linux ISDN support, and supported cards. +java.txt + - info on the in-kernel binary support for Java(tm). +kbuild/ + - directory with info about the kernel build process. +kernel-doc-nano-HOWTO.txt + - mini HowTo on generation and location of kernel documentation files. +kernel-docs.txt + - listing of various WWW + books that document kernel internals. +kernel-parameters.txt + - summary listing of command line / boot prompt args for the kernel. +kobject.txt + - info of the kobject infrastructure of the Linux kernel. +laptop-mode.txt + - How to conserve battery power using laptop-mode. +ldm.txt + - a brief description of LDM (Windows Dynamic Disks). +locks.txt + - info on file locking implementations, flock() vs. fcntl(), etc. +logo.gif + - Full colour GIF image of Linux logo (penguin). +logo.txt + - Info on creator of above logo & site to get additional images from. +m68k/ + - directory with info about Linux on Motorola 68k architecture. +magic-number.txt + - list of magic numbers used to mark/protect kernel data structures. +mandatory.txt + - info on the Linux implementation of Sys V mandatory file locking. +mca.txt + - info on supporting Micro Channel Architecture (e.g. PS/2) systems. +md.txt + - info on boot arguments for the multiple devices driver. +memory.txt + - info on typical Linux memory problems. +mips/ + - directory with info about Linux on MIPS architecture. +mono.txt + - how to execute Mono-based .NET binaries with the help of BINFMT_MISC. +moxa-smartio + - info on installing/using Moxa multiport serial driver. +mtrr.txt + - how to use PPro Memory Type Range Registers to increase performance. +nbd.txt + - info on a TCP implementation of a network block device. +networking/ + - directory with info on various aspects of networking with Linux. +nfsroot.txt + - short guide on setting up a diskless box with NFS root filesystem. +nmi_watchdog.txt + - info on NMI watchdog for SMP systems. +numastat.txt + - info on how to read Numa policy hit/miss statistics in sysfs. +oops-tracing.txt + - how to decode those nasty internal kernel error dump messages. +paride.txt + - information about the parallel port IDE subsystem. +parisc/ + - directory with info on using Linux on PA-RISC architecture. +parport.txt + - how to use the parallel-port driver. +parport-lowlevel.txt + - description and usage of the low level parallel port functions. +pci.txt + - info on the PCI subsystem for device driver authors. +pm.txt + - info on Linux power management support. +pnp.txt + - Linux Plug and Play documentation. +power/ + - directory with info on Linux PCI power management. +powerpc/ + - directory with info on using Linux with the PowerPC. +preempt-locking.txt + - info on locking under a preemptive kernel. +ramdisk.txt + - short guide on how to set up and use the RAM disk. +riscom8.txt + - notes on using the RISCom/8 multi-port serial driver. +rocket.txt + - info on the Comtrol RocketPort multiport serial driver. +rpc-cache.txt + - introduction to the caching mechanisms in the sunrpc layer. +rtc.txt + - notes on how to use the Real Time Clock (aka CMOS clock) driver. +s390/ + - directory with info on using Linux on the IBM S390. +sched-coding.txt + - reference for various scheduler-related methods in the O(1) scheduler. +sched-design.txt + - goals, design and implementation of the Linux O(1) scheduler. +sched-domains.txt + - information on scheduling domains. +sched-stats.txt + - information on schedstats (Linux Scheduler Statistics). +scsi/ + - directory with info on Linux scsi support. +serial/ + - directory with info on the low level serial API. +serial-console.txt + - how to set up Linux with a serial line console as the default. +sgi-visws.txt + - short blurb on the SGI Visual Workstations. +sh/ + - directory with info on porting Linux to a new architecture. +smart-config.txt + - description of the Smart Config makefile feature. +smp.txt + - a few notes on symmetric multi-processing. +sonypi.txt + - info on Linux Sony Programmable I/O Device support. +sound/ + - directory with info on sound card support. +sparc/ + - directory with info on using Linux on Sparc architecture. +specialix.txt + - info on hardware/driver for specialix IO8+ multiport serial card. +spinlocks.txt + - info on using spinlocks to provide exclusive access in kernel. +stallion.txt + - info on using the Stallion multiport serial driver. +svga.txt + - short guide on selecting video modes at boot via VGA BIOS. +sx.txt + - info on the Specialix SX/SI multiport serial driver. +sysctl/ + - directory with info on the /proc/sys/* files. +sysrq.txt + - info on the magic SysRq key. +telephony/ + - directory with info on telephony (e.g. voice over IP) support. +time_interpolators.txt + - info on time interpolators. +tipar.txt + - information about Parallel link cable for Texas Instruments handhelds. +tty.txt + - guide to the locking policies of the tty layer. +unicode.txt + - info on the Unicode character/font mapping used in Linux. +uml/ + - directory with infomation about User Mode Linux. +usb/ + - directory with info regarding the Universal Serial Bus. +video4linux/ + - directory with info regarding video/TV/radio cards and linux. +vm/ + - directory with info on the Linux vm code. +voyager.txt + - guide to running Linux on the Voyager architecture. +watchdog/ + - how to auto-reboot Linux if it has "fallen and can't get up". ;-) +x86_64/ + - directory with info on Linux support for AMD x86-64 (Hammer) machines. +xterm-linux.xpm + - XPM image of penguin logo (see logo.txt) sitting on an xterm. +zorro.txt + - info on writing drivers for Zorro bus devices found on Amigas. diff --git a/Documentation/BK-usage/00-INDEX b/Documentation/BK-usage/00-INDEX new file mode 100644 index 000000000000..82768784ea52 --- /dev/null +++ b/Documentation/BK-usage/00-INDEX @@ -0,0 +1,51 @@ +bk-kernel-howto.txt: Description of kernel workflow under BitKeeper + +bk-make-sum: Create summary of changesets in one repository and not +another, typically in preparation to be sent to an upstream maintainer. +Typical usage: + cd my-updated-repo + bk-make-sum ~/repo/original-repo + mv /tmp/linus.txt ../original-repo.txt + +bksend: Create readable text output containing summary of changes, GNU +patch of the changes, and BK metadata of changes (as needed for proper +importing into BitKeeper by an upstream maintainer). This output is +suitable for emailing BitKeeper changes. The recipient of this output +may pipe it directly to 'bk receive'. + +bz64wrap: helper script. Uncompressed input is piped to this script, +which compresses its input, and then outputs the uu-/base64-encoded +version of the compressed input. + +cpcset: Copy changeset between unrelated repositories. +Attempts to preserve changeset user, user address, description, in +addition to the changeset (the patch) itself. +Typical usage: + cd my-updated-repo + bk changes # looking for a changeset... + cpcset 1.1511 . ../another-repo + +csets-to-patches: Produces a delta of two BK repositories, in the form +of individual files, each containing a single cset as a GNU patch. +Output is several files, each with the filename "/tmp/rev-$REV.patch" +Typical usage: + cd my-updated-repo + bk changes -L ~/repo/original-repo 2>&1 | \ + perl csets-to-patches + +cset-to-linus: Produces a delta of two BK repositories, in the form of +changeset descriptions, with 'diffstat' output created for each +individual changset. +Typical usage: + cd my-updated-repo + bk changes -L ~/repo/original-repo 2>&1 | \ + perl cset-to-linus > summary.txt + +gcapatch: Generates patch containing changes in local repository. +Typical usage: + cd my-updated-repo + gcapatch > foo.patch + +unbz64wrap: Reverse an encoded, compressed data stream created by +bz64wrap into an uncompressed, typically text/plain output. + diff --git a/Documentation/BK-usage/bk-kernel-howto.txt b/Documentation/BK-usage/bk-kernel-howto.txt new file mode 100644 index 000000000000..b7b9075d2910 --- /dev/null +++ b/Documentation/BK-usage/bk-kernel-howto.txt @@ -0,0 +1,283 @@ + + Doing the BK Thing, Penguin-Style + + + + +This set of notes is intended mainly for kernel developers, occasional +or full-time, but sysadmins and power users may find parts of it useful +as well. It assumes at least a basic familiarity with CVS, both at a +user level (use on the cmd line) and at a higher level (client-server model). +Due to the author's background, an operation may be described in terms +of CVS, or in terms of how that operation differs from CVS. + +This is -not- intended to be BitKeeper documentation. Always run +"bk help <command>" or in X "bk helptool <command>" for reference +documentation. + + +BitKeeper Concepts +------------------ + +In the true nature of the Internet itself, BitKeeper is a distributed +system. When applied to revision control, this means doing away with +client-server, and changing to a parent-child model... essentially +peer-to-peer. On the developer's end, this also represents a +fundamental disruption in the standard workflow of changes, commits, +and merges. You will need to take a few minutes to think about +how to best work under BitKeeper, and re-optimize things a bit. +In some sense it is a bit radical, because it might described as +tossing changes out into a maelstrom and having them magically +land at the right destination... but I'm getting ahead of myself. + +Let's start with this progression: +Each BitKeeper source tree on disk is a repository unto itself. +Each repository has a parent (except the root/original, of course). +Each repository contains a set of a changesets ("csets"). +Each cset is one or more changed files, bundled together. + +Each tree is a repository, so all changes are checked into the local +tree. When a change is checked in, all modified files are grouped +into a logical unit, the changeset. Internally, BK links these +changesets in a tree, representing various converging and diverging +lines of development. These changesets are the bread and butter of +the BK system. + +After the concept of changesets, the next thing you need to get used +to is having multiple copies of source trees lying around. This -really- +takes some getting used to, for some people. Separate source trees +are the means in BitKeeper by which you delineate parallel lines +of development, both minor and major. What would be branches in +CVS become separate source trees, or "clones" in BitKeeper [heh, +or Star Wars] terminology. + +Clones and changesets are the tools from which most of the power of +BitKeeper is derived. As mentioned earlier, each clone has a parent, +the tree used as the source when the new clone was created. In a +CVS-like setup, the parent would be a remote server on the Internet, +and the child is your local clone of that tree. + +Once you have established a common baseline between two source trees -- +a common parent -- then you can merge changesets between those two +trees with ease. Merging changes into a tree is called a "pull", and +is analagous to 'cvs update'. A pull downloads all the changesets in +the remote tree you do not have, and merges them. Sending changes in +one tree to another tree is called a "push". Push sends all changes +in the local tree the remote does not yet have, and merges them. + +From these concepts come some initial command examples: + +1) bk clone -q http://linux.bkbits.net/linux-2.5 linus-2.5 +Download a 2.5 stock kernel tree, naming it "linus-2.5" in the local dir. +The "-q" disables listing every single file as it is downloaded. + +2) bk clone -ql linus-2.5 alpha-2.5 +Create a separate source tree for the Alpha AXP architecture. +The "-l" uses hard links instead of copying data, since both trees are +on the local disk. You can also replace the above with "bk lclone -q ..." + +You only clone a tree -once-. After cloning the tree lives a long time +on disk, being updating by pushes and pulls. + +3) cd alpha-2.5 ; bk pull http://gkernel.bkbits.net/alpha-2.5 +Download changes in "alpha-2.5" repository which are not present +in the local repository, and merge them into the source tree. + +4) bk -r co -q +Because every tree is a repository, files must be checked out before +they will be in their standard places in the source tree. + +5) bk vi fs/inode.c # example change... + bk citool # checkin, using X tool + bk push bk://gkernel@bkbits.net/alpha-2.5 # upload change +Typical example of a BK sequence that would replace the analagous CVS +situation, + vi fs/inode.c + cvs commit + +As this is just supposed to be a quick BK intro, for more in-depth +tutorials, live working demos, and docs, see http://www.bitkeeper.com/ + + + +BK and Kernel Development Workflow +---------------------------------- +Currently the latest 2.5 tree is available via "bk clone $URL" +and "bk pull $URL" at http://linux.bkbits.net/linux-2.5 +This should change in a few weeks to a kernel.org URL. + + +A big part of using BitKeeper is organizing the various trees you have +on your local disk, and organizing the flow of changes among those +trees, and remote trees. If one were to graph the relationships between +a desired BK setup, you are likely to see a few-many-few graph, like +this: + + linux-2.5 + | + merge-to-linus-2.5 + / | | + / | | + vm-hacks bugfixes filesys personal-hacks + \ | | / + \ | | / + \ | | / + testing-and-validation + +Since a "bk push" sends all changes not in the target tree, and +since a "bk pull" receives all changes not in the source tree, you want +to make sure you are only pushing specific changes to the desired tree, +not all changes from "peer parent" trees. For example, pushing a change +from the testing-and-validation tree would probably be a bad idea, +because it will push all changes from vm-hacks, bugfixes, filesys, and +personal-hacks trees into the target tree. + +One would typically work on only one "theme" at a time, either +vm-hacks or bugfixes or filesys, keeping those changes isolated in +their own tree during development, and only merge the isolated with +other changes when going upstream (to Linus or other maintainers) or +downstream (to your "union" trees, like testing-and-validation above). + +It should be noted that some of this separation is not just recommended +practice, it's actually [for now] -enforced- by BitKeeper. BitKeeper +requires that changesets maintain a certain order, which is the reason +that "bk push" sends all local changesets the remote doesn't have. This +separation may look like a lot of wasted disk space at first, but it +helps when two unrelated changes may "pollute" the same area of code, or +don't follow the same pace of development, or any other of the standard +reasons why one creates a development branch. + +Small development branches (clones) will appear and disappear: + + -------- A --------- B --------- C --------- D ------- + \ / + -----short-term devel branch----- + +While long-term branches will parallel a tree (or trees), with period +merge points. In this first example, we pull from a tree (pulls, +"\") periodically, such as what occurs when tracking changes in a +vendor tree, never pushing changes back up the line: + + -------- A --------- B --------- C --------- D ------- + \ \ \ + ----long-term devel branch----------------- + +And then a more common case in Linux kernel development, a long term +branch with periodic merges back into the tree (pushes, "/"): + + -------- A --------- B --------- C --------- D ------- + \ \ / \ + ----long-term devel branch----------------- + + + + + +Submitting Changes to Linus +--------------------------- +There's a bit of an art, or style, of submitting changes to Linus. +Since Linus's tree is now (you might say) fully integrated into the +distributed BitKeeper system, there are several prerequisites to +properly submitting a BitKeeper change. All these prereq's are just +general cleanliness of BK usage, so as people become experts at BK, feel +free to optimize this process further (assuming Linus agrees, of +course). + + + +0) Make sure your tree was originally cloned from the linux-2.5 tree +created by Linus. If your tree does not have this as its ancestor, it +is impossible to reliably exchange changesets. + + + +1) Pay attention to your commit text. The commit message that +accompanies each changeset you submit will live on forever in history, +and is used by Linus to accurately summarize the changes in each +pre-patch. Remember that there is no context, so + "fix for new scheduler changes" +would be too vague, but + "fix mips64 arch for new scheduler switch_to(), TIF_xxx semantics" +would be much better. + +You can and should use the command "bk comment -C<rev>" to update the +commit text, and improve it after the fact. This is very useful for +development: poor, quick descriptions during development, which get +cleaned up using "bk comment" before issuing the "bk push" to submit the +changes. + + + +2) Include an Internet-available URL for Linus to pull from, such as + + Pull from: http://gkernel.bkbits.net/net-drivers-2.5 + + + +3) Include a summary and "diffstat -p1" of each changeset that will be +downloaded, when Linus issues a "bk pull". The author auto-generates +these summaries using "bk changes -L <parent>", to obtain a listing +of all the pending-to-send changesets, and their commit messages. + +It is important to show Linus what he will be downloading when he issues +a "bk pull", to reduce the time required to sift the changes once they +are downloaded to Linus's local machine. + +IMPORTANT NOTE: One of the features of BK is that your repository does +not have to be up to date, in order for Linus to receive your changes. +It is considered a courtesy to keep your repository fairly recent, to +lessen any potential merge work Linus may need to do. + + +4) Split up your changes. Each maintainer<->Linus situation is likely +to be slightly different here, so take this just as general advice. The +author splits up changes according to "themes" when merging with Linus. +Simultaneous pushes from local development go to special trees which +exist solely to house changes "queued" for Linus. Example of the trees: + + net-drivers-2.5 -- on-going net driver maintenance + vm-2.5 -- VM-related changes + fs-2.5 -- filesystem-related changes + +Linus then has much more freedom for pulling changes. He could (for +example) issue a "bk pull" on vm-2.5 and fs-2.5 trees, to merge their +changes, but hold off net-drivers-2.5 because of a change that needs +more discussion. + +Other maintainers may find that a single linus-pull-from tree is +adequate for passing BK changesets to him. + + + +Frequently Answered Questions +----------------------------- +1) How do I change the e-mail address shown in the changelog? +A. When you run "bk citool" or "bk commit", set environment + variables BK_USER and BK_HOST to the desired username + and host/domain name. + + +2) How do I use tags / get a diff between two kernel versions? +A. Pass the tags Linus uses to 'bk export'. + +ChangeSets are in a forward-progressing order, so it's pretty easy +to get a snapshot starting and ending at any two points in time. +Linus puts tags on each release and pre-release, so you could use +these two examples: + + bk export -tpatch -hdu -rv2.5.4,v2.5.5 | less + # creates patch-2.5.5 essentially + bk export -tpatch -du -rv2.5.5-pre1,v2.5.5 | less + # changes from pre1 to final + +A tag is just an alias for a specific changeset... and since changesets +are ordered, a tag is thus a marker for a specific point in time (or +specific state of the tree). + + +3) Is there an easy way to generate One Big Patch versus mainline, + for my long-lived kernel branch? +A. Yes. This requires BK 3.x, though. + + bk export -tpatch -r`bk repogca bk://linux.bkbits.net/linux-2.5`,+ + diff --git a/Documentation/BK-usage/bk-make-sum b/Documentation/BK-usage/bk-make-sum new file mode 100755 index 000000000000..58ca46a0fcc6 --- /dev/null +++ b/Documentation/BK-usage/bk-make-sum @@ -0,0 +1,34 @@ +#!/bin/sh -e +# DIR=$HOME/BK/axp-2.5 +# cd $DIR + +LINUS_REPO=$1 +DIRBASE=`basename $PWD` + +{ +cat <<EOT +Please do a + + bk pull bk://gkernel.bkbits.net/$DIRBASE + +This will update the following files: + +EOT + +bk export -tpatch -hdu -r`bk repogca $LINUS_REPO`,+ | diffstat -p1 2>/dev/null + +cat <<EOT + +through these ChangeSets: + +EOT + +bk changes -L -d'$unless(:MERGE:){ChangeSet|:CSETREV:\n}' $LINUS_REPO | +bk -R prs -h -d'$unless(:MERGE:){<:P:@:HOST:> (:D: :I:)\n$each(:C:){ (:C:)\n}\n}' - + +} > /tmp/linus.txt + +cat <<EOT +Mail text in /tmp/linus.txt; please check and send using your favourite +mailer. +EOT diff --git a/Documentation/BK-usage/bksend b/Documentation/BK-usage/bksend new file mode 100755 index 000000000000..836ca943694f --- /dev/null +++ b/Documentation/BK-usage/bksend @@ -0,0 +1,36 @@ +#!/bin/sh +# A script to format BK changeset output in a manner that is easy to read. +# Andreas Dilger <adilger@turbolabs.com> 13/02/2002 +# +# Add diffstat output after Changelog <adilger@turbolabs.com> 21/02/2002 + +PROG=bksend + +usage() { + echo "usage: $PROG -r<rev>" + echo -e "\twhere <rev> is of the form '1.23', '1.23..', '1.23..1.27'," + echo -e "\tor '+' to indicate the most recent revision" + + exit 1 +} + +case $1 in +-r) REV=$2; shift ;; +-r*) REV=`echo $1 | sed 's/^-r//'` ;; +*) echo "$PROG: no revision given, you probably don't want that";; +esac + +[ -z "$REV" ] && usage + +echo "You can import this changeset into BK by piping this whole message to:" +echo "'| bk receive [path to repository]' or apply the patch as usual." + +SEP="\n===================================================================\n\n" +echo -e $SEP +env PAGER=/bin/cat bk changes -r$REV +echo +bk export -tpatch -du -h -r$REV | diffstat +echo; echo +bk export -tpatch -du -h -r$REV +echo -e $SEP +bk send -wgzip_uu -r$REV - diff --git a/Documentation/BK-usage/bz64wrap b/Documentation/BK-usage/bz64wrap new file mode 100755 index 000000000000..be780876849f --- /dev/null +++ b/Documentation/BK-usage/bz64wrap @@ -0,0 +1,41 @@ +#!/bin/sh + +# bz64wrap - the sending side of a bzip2 | base64 stream +# Andreas Dilger <adilger@clusterfs.com> Jan 2002 + + +PATH=$PATH:/usr/bin:/usr/local/bin:/usr/freeware/bin + +# A program to generate base64 encoding on stdout +BASE64_ENCODE="uuencode -m /dev/stdout" +BASE64_BEGIN= +BASE64_END= + +BZIP=NO +BASE64=NO + +# Test if we have the bzip program installed +bzip2 -c /dev/null > /dev/null 2>&1 && BZIP=YES + +# Test if uuencode can handle the -m (MIME) encoding option +$BASE64_ENCODE < /dev/null > /dev/null 2>&1 && BASE64=YES + +if [ $BASE64 = NO ]; then + BASE64_ENCODE=mimencode + BASE64_BEGIN="begin-base64 644 -" + BASE64_END="====" + + $BASE64_ENCODE < /dev/null > /dev/null 2>&1 && BASE64=YES +fi + +if [ $BZIP = NO -o $BASE64 = NO ]; then + echo "$0: can't use bz64 encoding: bzip2=$BZIP, $BASE64_ENCODE=$BASE64" + exit 1 +fi + +# Sadly, mimencode does not appear to have good "begin" and "end" markers +# like uuencode does, and it is picky about getting the right start/end of +# the base64 stream, so we handle this internally. +echo "$BASE64_BEGIN" +bzip2 -9 | $BASE64_ENCODE +echo "$BASE64_END" diff --git a/Documentation/BK-usage/cpcset b/Documentation/BK-usage/cpcset new file mode 100755 index 000000000000..b8faca97dab9 --- /dev/null +++ b/Documentation/BK-usage/cpcset @@ -0,0 +1,36 @@ +#!/bin/sh +# +# Purpose: Copy changeset patch and description from one +# repository to another, unrelated one. +# +# usage: cpcset [revision] [from-repository] [to-repository] +# + +REV=$1 +FROM=$2 +TO=$3 +TMPF=/tmp/cpcset.$$ + +rm -f $TMPF* + +CWD_SAVE=`pwd` +cd $FROM +bk changes -r$REV | \ + grep -v '^ChangeSet' | \ + sed -e 's/^ //g' > $TMPF.log + +USERHOST=`bk changes -r$REV | grep '^ChangeSet' | awk '{print $4}'` +export BK_USER=`echo $USERHOST | awk '-F@' '{print $1}'` +export BK_HOST=`echo $USERHOST | awk '-F@' '{print $2}'` + +bk export -tpatch -hdu -r$REV > $TMPF.patch && \ +cd $CWD_SAVE && \ +cd $TO && \ +bk import -tpatch -CFR -y"`cat $TMPF.log`" $TMPF.patch . && \ +bk commit -y"`cat $TMPF.log`" + +rm -f $TMPF* + +echo changeset $REV copied. +echo "" + diff --git a/Documentation/BK-usage/cset-to-linus b/Documentation/BK-usage/cset-to-linus new file mode 100755 index 000000000000..d28a96f8c618 --- /dev/null +++ b/Documentation/BK-usage/cset-to-linus @@ -0,0 +1,49 @@ +#!/usr/bin/perl -w + +use strict; + +my ($lhs, $rev, $tmp, $rhs, $s); +my @cset_text = (); +my @pipe_text = (); +my $have_cset = 0; + +while (<>) { + next if /^---/; + + if (($lhs, $tmp, $rhs) = (/^(ChangeSet\@)([^,]+)(, .*)$/)) { + &cset_rev if ($have_cset); + + $rev = $tmp; + $have_cset = 1; + + push(@cset_text, $_); + } + + elsif ($have_cset) { + push(@cset_text, $_); + } +} +&cset_rev if ($have_cset); +exit(0); + + +sub cset_rev { + my $empty_cset = 0; + + open PIPE, "bk export -tpatch -hdu -r $rev | diffstat -p1 2>/dev/null |" or die; + while ($s = <PIPE>) { + $empty_cset = 1 if ($s =~ /0 files changed/); + push(@pipe_text, $s); + } + close(PIPE); + + if (! $empty_cset) { + print @cset_text; + print @pipe_text; + print "\n\n"; + } + + @pipe_text = (); + @cset_text = (); +} + diff --git a/Documentation/BK-usage/csets-to-patches b/Documentation/BK-usage/csets-to-patches new file mode 100755 index 000000000000..e2b81c35883f --- /dev/null +++ b/Documentation/BK-usage/csets-to-patches @@ -0,0 +1,44 @@ +#!/usr/bin/perl -w + +use strict; + +my ($lhs, $rev, $tmp, $rhs, $s); +my @cset_text = (); +my @pipe_text = (); +my $have_cset = 0; + +while (<>) { + next if /^---/; + + if (($lhs, $tmp, $rhs) = (/^(ChangeSet\@)([^,]+)(, .*)$/)) { + &cset_rev if ($have_cset); + + $rev = $tmp; + $have_cset = 1; + + push(@cset_text, $_); + } + + elsif ($have_cset) { + push(@cset_text, $_); + } +} +&cset_rev if ($have_cset); +exit(0); + + +sub cset_rev { + my $empty_cset = 0; + + system("bk export -tpatch -du -r $rev > /tmp/rev-$rev.patch"); + + if (! $empty_cset) { + print @cset_text; + print @pipe_text; + print "\n\n"; + } + + @pipe_text = (); + @cset_text = (); +} + diff --git a/Documentation/BK-usage/gcapatch b/Documentation/BK-usage/gcapatch new file mode 100755 index 000000000000..aaeb17dc7c7f --- /dev/null +++ b/Documentation/BK-usage/gcapatch @@ -0,0 +1,8 @@ +#!/bin/sh +# +# Purpose: Generate GNU diff of local changes versus canonical top-of-tree +# +# Usage: gcapatch > foo.patch +# + +bk export -tpatch -hdu -r`bk repogca bk://linux.bkbits.net/linux-2.5`,+ diff --git a/Documentation/BK-usage/unbz64wrap b/Documentation/BK-usage/unbz64wrap new file mode 100755 index 000000000000..4fc3e73e9a81 --- /dev/null +++ b/Documentation/BK-usage/unbz64wrap @@ -0,0 +1,25 @@ +#!/bin/sh + +# unbz64wrap - the receiving side of a bzip2 | base64 stream +# Andreas Dilger <adilger@clusterfs.com> Jan 2002 + +# Sadly, mimencode does not appear to have good "begin" and "end" markers +# like uuencode does, and it is picky about getting the right start/end of +# the base64 stream, so we handle this explicitly here. + +PATH=$PATH:/usr/bin:/usr/local/bin:/usr/freeware/bin + +if mimencode -u < /dev/null > /dev/null 2>&1 ; then + SHOW= + while read LINE; do + case $LINE in + begin-base64*) SHOW=YES ;; + ====) SHOW= ;; + *) [ "$SHOW" ] && echo "$LINE" ;; + esac + done | mimencode -u | bunzip2 + exit $? +else + cat - | uudecode -o /dev/stdout | bunzip2 + exit $? +fi diff --git a/Documentation/BUG-HUNTING b/Documentation/BUG-HUNTING new file mode 100644 index 000000000000..ca29242dbc38 --- /dev/null +++ b/Documentation/BUG-HUNTING @@ -0,0 +1,92 @@ +[Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)] + +This is how to track down a bug if you know nothing about kernel hacking. +It's a brute force approach but it works pretty well. + +You need: + + . A reproducible bug - it has to happen predictably (sorry) + . All the kernel tar files from a revision that worked to the + revision that doesn't + +You will then do: + + . Rebuild a revision that you believe works, install, and verify that. + . Do a binary search over the kernels to figure out which one + introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but + you know that 1.3.69 does. Pick a kernel in the middle and build + that, like 1.3.50. Build & test; if it works, pick the mid point + between .50 and .69, else the mid point between .28 and .50. + . You'll narrow it down to the kernel that introduced the bug. You + can probably do better than this but it gets tricky. + + . Narrow it down to a subdirectory + + - Copy kernel that works into "test". Let's say that 3.62 works, + but 3.63 doesn't. So you diff -r those two kernels and come + up with a list of directories that changed. For each of those + directories: + + Copy the non-working directory next to the working directory + as "dir.63". + One directory at time, try moving the working directory to + "dir.62" and mv dir.63 dir"time, try + + mv dir dir.62 + mv dir.63 dir + find dir -name '*.[oa]' -print | xargs rm -f + + And then rebuild and retest. Assuming that all related + changes were contained in the sub directory, this should + isolate the change to a directory. + + Problems: changes in header files may have occurred; I've + found in my case that they were self explanatory - you may + or may not want to give up when that happens. + + . Narrow it down to a file + + - You can apply the same technique to each file in the directory, + hoping that the changes in that file are self contained. + + . Narrow it down to a routine + + - You can take the old file and the new file and manually create + a merged file that has + + #ifdef VER62 + routine() + { + ... + } + #else + routine() + { + ... + } + #endif + + And then walk through that file, one routine at a time and + prefix it with + + #define VER62 + /* both routines here */ + #undef VER62 + + Then recompile, retest, move the ifdefs until you find the one + that makes the difference. + +Finally, you take all the info that you have, kernel revisions, bug +description, the extent to which you have narrowed it down, and pass +that off to whomever you believe is the maintainer of that section. +A post to linux.dev.kernel isn't such a bad idea if you've done some +work to narrow it down. + +If you get it down to a routine, you'll probably get a fix in 24 hours. + +My apologies to Linus and the other kernel hackers for describing this +brute force approach, it's hardly what a kernel hacker would do. However, +it does work and it lets non-hackers help fix bugs. And it is cool +because Linux snapshots will let you do this - something that you can't +do with vendor supplied releases. + diff --git a/Documentation/Changes b/Documentation/Changes new file mode 100644 index 000000000000..caa6a5529b6b --- /dev/null +++ b/Documentation/Changes @@ -0,0 +1,410 @@ +Intro +===== + +This document is designed to provide a list of the minimum levels of +software necessary to run the 2.6 kernels, as well as provide brief +instructions regarding any other "Gotchas" users may encounter when +trying life on the Bleeding Edge. If upgrading from a pre-2.4.x +kernel, please consult the Changes file included with 2.4.x kernels for +additional information; most of that information will not be repeated +here. Basically, this document assumes that your system is already +functional and running at least 2.4.x kernels. + +This document is originally based on my "Changes" file for 2.0.x kernels +and therefore owes credit to the same people as that file (Jared Mauch, +Axel Boldt, Alessandro Sigala, and countless other users all over the +'net). + +The latest revision of this document, in various formats, can always +be found at <http://cyberbuzz.gatech.edu/kaboom/linux/Changes-2.4/>. + +Feel free to translate this document. If you do so, please send me a +URL to your translation for inclusion in future revisions of this +document. + +Smotrite file <http://oblom.rnc.ru/linux/kernel/Changes.ru>, yavlyaushisya +russkim perevodom dannogo documenta. + +Visite <http://www2.adi.uam.es/~ender/tecnico/> para obtener la traducción +al español de este documento en varios formatos. + +Eine deutsche Version dieser Datei finden Sie unter +<http://www.stefan-winter.de/Changes-2.4.0.txt>. + +Last updated: October 29th, 2002 + +Chris Ricker (kaboom@gatech.edu or chris.ricker@genetics.utah.edu). + +Current Minimal Requirements +============================ + +Upgrade to at *least* these software revisions before thinking you've +encountered a bug! If you're unsure what version you're currently +running, the suggested command should tell you. + +Again, keep in mind that this list assumes you are already +functionally running a Linux 2.4 kernel. Also, not all tools are +necessary on all systems; obviously, if you don't have any PCMCIA (PC +Card) hardware, for example, you probably needn't concern yourself +with pcmcia-cs. + +o Gnu C 2.95.3 # gcc --version +o Gnu make 3.79.1 # make --version +o binutils 2.12 # ld -v +o util-linux 2.10o # fdformat --version +o module-init-tools 0.9.10 # depmod -V +o e2fsprogs 1.29 # tune2fs +o jfsutils 1.1.3 # fsck.jfs -V +o reiserfsprogs 3.6.3 # reiserfsck -V 2>&1|grep reiserfsprogs +o xfsprogs 2.6.0 # xfs_db -V +o pcmcia-cs 3.1.21 # cardmgr -V +o quota-tools 3.09 # quota -V +o PPP 2.4.0 # pppd --version +o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version +o nfs-utils 1.0.5 # showmount --version +o procps 3.2.0 # ps --version +o oprofile 0.5.3 # oprofiled --version + +Kernel compilation +================== + +GCC +--- + +The gcc version requirements may vary depending on the type of CPU in your +computer. The next paragraph applies to users of x86 CPUs, but not +necessarily to users of other CPUs. Users of other CPUs should obtain +information about their gcc version requirements from another source. + +The recommended compiler for the kernel is gcc 2.95.x (x >= 3), and it +should be used when you need absolute stability. You may use gcc 3.0.x +instead if you wish, although it may cause problems. Later versions of gcc +have not received much testing for Linux kernel compilation, and there are +almost certainly bugs (mainly, but not exclusively, in the kernel) that +will need to be fixed in order to use these compilers. In any case, using +pgcc instead of plain gcc is just asking for trouble. + +The Red Hat gcc 2.96 compiler subtree can also be used to build this tree. +You should ensure you use gcc-2.96-74 or later. gcc-2.96-54 will not build +the kernel correctly. + +In addition, please pay attention to compiler optimization. Anything +greater than -O2 may not be wise. Similarly, if you choose to use gcc-2.95.x +or derivatives, be sure not to use -fstrict-aliasing (which, depending on +your version of gcc 2.95.x, may necessitate using -fno-strict-aliasing). + +Make +---- + +You will need Gnu make 3.79.1 or later to build the kernel. + +Binutils +-------- + +Linux on IA-32 has recently switched from using as86 to using gas for +assembling the 16-bit boot code, removing the need for as86 to compile +your kernel. This change does, however, mean that you need a recent +release of binutils. + +System utilities +================ + +Architectural changes +--------------------- + +DevFS has been obsoleted in favour of udev +(http://www.kernel.org/pub/linux/utils/kernel/hotplug/) + +32-bit UID support is now in place. Have fun! + +Linux documentation for functions is transitioning to inline +documentation via specially-formatted comments near their +definitions in the source. These comments can be combined with the +SGML templates in the Documentation/DocBook directory to make DocBook +files, which can then be converted by DocBook stylesheets to PostScript, +HTML, PDF files, and several other formats. In order to convert from +DocBook format to a format of your choice, you'll need to install Jade as +well as the desired DocBook stylesheets. + +Util-linux +---------- + +New versions of util-linux provide *fdisk support for larger disks, +support new options to mount, recognize more supported partition +types, have a fdformat which works with 2.4 kernels, and similar goodies. +You'll probably want to upgrade. + +Ksymoops +-------- + +If the unthinkable happens and your kernel oopses, you'll need a 2.4 +version of ksymoops to decode the report; see REPORTING-BUGS in the +root of the Linux source for more information. + +Module-Init-Tools +----------------- + +A new module loader is now in the kernel that requires module-init-tools +to use. It is backward compatible with the 2.4.x series kernels. + +Mkinitrd +-------- + +These changes to the /lib/modules file tree layout also require that +mkinitrd be upgraded. + +E2fsprogs +--------- + +The latest version of e2fsprogs fixes several bugs in fsck and +debugfs. Obviously, it's a good idea to upgrade. + +JFSutils +-------- + +The jfsutils package contains the utilities for the file system. +The following utilities are available: +o fsck.jfs - initiate replay of the transaction log, and check + and repair a JFS formatted partition. +o mkfs.jfs - create a JFS formatted partition. +o other file system utilities are also available in this package. + +Reiserfsprogs +------------- + +The reiserfsprogs package should be used for reiserfs-3.6.x +(Linux kernels 2.4.x). It is a combined package and contains working +versions of mkreiserfs, resize_reiserfs, debugreiserfs and +reiserfsck. These utils work on both i386 and alpha platforms. + +Xfsprogs +-------- + +The latest version of xfsprogs contains mkfs.xfs, xfs_db, and the +xfs_repair utilities, among others, for the XFS filesystem. It is +architecture independent and any version from 2.0.0 onward should +work correctly with this version of the XFS kernel code (2.6.0 or +later is recommended, due to some significant improvements). + + +Pcmcia-cs +--------- + +PCMCIA (PC Card) support is now partially implemented in the main +kernel source. Pay attention when you recompile your kernel ;-). +Also, be sure to upgrade to the latest pcmcia-cs release. + +Quota-tools +----------- + +Support for 32 bit uid's and gid's is required if you want to use +the newer version 2 quota format. Quota-tools version 3.07 and +newer has this support. Use the recommended version or newer +from the table above. + +Intel IA32 microcode +-------------------- + +A driver has been added to allow updating of Intel IA32 microcode, +accessible as both a devfs regular file and as a normal (misc) +character device. If you are not using devfs you may need to: + +mkdir /dev/cpu +mknod /dev/cpu/microcode c 10 184 +chmod 0644 /dev/cpu/microcode + +as root before you can use this. You'll probably also want to +get the user-space microcode_ctl utility to use with this. + +Powertweak +---------- + +If you are running v0.1.17 or earlier, you should upgrade to +version v0.99.0 or higher. Running old versions may cause problems +with programs using shared memory. + +udev +---- +udev is a userspace application for populating /dev dynamically with +only entries for devices actually present. udev replaces devfs. + +Networking +========== + +General changes +--------------- + +If you have advanced network configuration needs, you should probably +consider using the network tools from ip-route2. + +Packet Filter / NAT +------------------- +The packet filtering and NAT code uses the same tools like the previous 2.4.x +kernel series (iptables). It still includes backwards-compatibility modules +for 2.2.x-style ipchains and 2.0.x-style ipfwadm. + +PPP +--- + +The PPP driver has been restructured to support multilink and to +enable it to operate over diverse media layers. If you use PPP, +upgrade pppd to at least 2.4.0. + +If you are not using devfs, you must have the device file /dev/ppp +which can be made by: + +mknod /dev/ppp c 108 0 + +as root. + +If you use devfsd and build ppp support as modules, you will need +the following in your /etc/devfsd.conf file: + +LOOKUP PPP MODLOAD + +Isdn4k-utils +------------ + +Due to changes in the length of the phone number field, isdn4k-utils +needs to be recompiled or (preferably) upgraded. + +NFS-utils +--------- + +In 2.4 and earlier kernels, the nfs server needed to know about any +client that expected to be able to access files via NFS. This +information would be given to the kernel by "mountd" when the client +mounted the filesystem, or by "exportfs" at system startup. exportfs +would take information about active clients from /var/lib/nfs/rmtab. + +This approach is quite fragile as it depends on rmtab being correct +which is not always easy, particularly when trying to implement +fail-over. Even when the system is working well, rmtab suffers from +getting lots of old entries that never get removed. + +With 2.6 we have the option of having the kernel tell mountd when it +gets a request from an unknown host, and mountd can give appropriate +export information to the kernel. This removes the dependency on +rmtab and means that the kernel only needs to know about currently +active clients. + +To enable this new functionality, you need to: + + mount -t nfsd nfsd /proc/fs/nfs + +before running exportfs or mountd. It is recommended that all NFS +services be protected from the internet-at-large by a firewall where +that is possible. + +Getting updated software +======================== + +Kernel compilation +****************** + +gcc 2.95.3 +---------- +o <ftp://ftp.gnu.org/gnu/gcc/gcc-2.95.3.tar.gz> + +Make +---- +o <ftp://ftp.gnu.org/gnu/make/> + +Binutils +-------- +o <ftp://ftp.kernel.org/pub/linux/devel/binutils/> + +System utilities +**************** + +Util-linux +---------- +o <ftp://ftp.kernel.org/pub/linux/utils/util-linux/> + +Ksymoops +-------- +o <ftp://ftp.kernel.org/pub/linux/utils/kernel/ksymoops/v2.4/> + +Module-Init-Tools +----------------- +o <ftp://ftp.kernel.org/pub/linux/kernel/people/rusty/modules/> + +Mkinitrd +-------- +o <ftp://rawhide.redhat.com/pub/rawhide/SRPMS/SRPMS/> + +E2fsprogs +--------- +o <http://prdownloads.sourceforge.net/e2fsprogs/e2fsprogs-1.29.tar.gz> + +JFSutils +-------- +o <http://jfs.sourceforge.net/> + +Reiserfsprogs +------------- +o <http://www.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.3.tar.gz> + +Xfsprogs +-------- +o <ftp://oss.sgi.com/projects/xfs/download/> + +Pcmcia-cs +--------- +o <ftp://pcmcia-cs.sourceforge.net/pub/pcmcia-cs/pcmcia-cs-3.1.21.tar.gz> + +Quota-tools +---------- +o <http://sourceforge.net/projects/linuxquota/> + +Jade +---- +o <ftp://ftp.jclark.com/pub/jade/jade-1.2.1.tar.gz> + +DocBook Stylesheets +------------------- +o <http://nwalsh.com/docbook/dsssl/> + +Intel P6 microcode +------------------ +o <http://www.urbanmyth.org/microcode/> + +Powertweak +---------- +o <http://powertweak.sourceforge.net/> + +udev +---- +o <http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html> + +Networking +********** + +PPP +--- +o <ftp://ftp.samba.org/pub/ppp/ppp-2.4.0.tar.gz> + +Isdn4k-utils +------------ +o <ftp://ftp.isdn4linux.de/pub/isdn4linux/utils/isdn4k-utils.v3.1pre1.tar.gz> + +NFS-utils +--------- +o <http://sourceforge.net/project/showfiles.php?group_id=14> + +Iptables +-------- +o <http://www.iptables.org/downloads.html> + +Ip-route2 +--------- +o <ftp://ftp.tux.org/pub/net/ip-routing/iproute2-2.2.4-now-ss991023.tar.gz> + +OProfile +-------- +o <http://oprofile.sf.net/download/> + +NFS-Utils +--------- +o <http://nfs.sourceforge.net/> + diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle new file mode 100644 index 000000000000..f25b3953f513 --- /dev/null +++ b/Documentation/CodingStyle @@ -0,0 +1,431 @@ + + Linux kernel coding style + +This is a short document describing the preferred coding style for the +linux kernel. Coding style is very personal, and I won't _force_ my +views on anybody, but this is what goes for anything that I have to be +able to maintain, and I'd prefer it for most other things too. Please +at least consider the points made here. + +First off, I'd suggest printing out a copy of the GNU coding standards, +and NOT read it. Burn them, it's a great symbolic gesture. + +Anyway, here goes: + + + Chapter 1: Indentation + +Tabs are 8 characters, and thus indentations are also 8 characters. +There are heretic movements that try to make indentations 4 (or even 2!) +characters deep, and that is akin to trying to define the value of PI to +be 3. + +Rationale: The whole idea behind indentation is to clearly define where +a block of control starts and ends. Especially when you've been looking +at your screen for 20 straight hours, you'll find it a lot easier to see +how the indentation works if you have large indentations. + +Now, some people will claim that having 8-character indentations makes +the code move too far to the right, and makes it hard to read on a +80-character terminal screen. The answer to that is that if you need +more than 3 levels of indentation, you're screwed anyway, and should fix +your program. + +In short, 8-char indents make things easier to read, and have the added +benefit of warning you when you're nesting your functions too deep. +Heed that warning. + +Don't put multiple statements on a single line unless you have +something to hide: + + if (condition) do_this; + do_something_everytime; + +Outside of comments, documentation and except in Kconfig, spaces are never +used for indentation, and the above example is deliberately broken. + +Get a decent editor and don't leave whitespace at the end of lines. + + + Chapter 2: Breaking long lines and strings + +Coding style is all about readability and maintainability using commonly +available tools. + +The limit on the length of lines is 80 columns and this is a hard limit. + +Statements longer than 80 columns will be broken into sensible chunks. +Descendants are always substantially shorter than the parent and are placed +substantially to the right. The same applies to function headers with a long +argument list. Long strings are as well broken into shorter strings. + +void fun(int a, int b, int c) +{ + if (condition) + printk(KERN_WARNING "Warning this is a long printk with " + "3 parameters a: %u b: %u " + "c: %u \n", a, b, c); + else + next_statement; +} + + Chapter 3: Placing Braces + +The other issue that always comes up in C styling is the placement of +braces. Unlike the indent size, there are few technical reasons to +choose one placement strategy over the other, but the preferred way, as +shown to us by the prophets Kernighan and Ritchie, is to put the opening +brace last on the line, and put the closing brace first, thusly: + + if (x is true) { + we do y + } + +However, there is one special case, namely functions: they have the +opening brace at the beginning of the next line, thus: + + int function(int x) + { + body of function + } + +Heretic people all over the world have claimed that this inconsistency +is ... well ... inconsistent, but all right-thinking people know that +(a) K&R are _right_ and (b) K&R are right. Besides, functions are +special anyway (you can't nest them in C). + +Note that the closing brace is empty on a line of its own, _except_ in +the cases where it is followed by a continuation of the same statement, +ie a "while" in a do-statement or an "else" in an if-statement, like +this: + + do { + body of do-loop + } while (condition); + +and + + if (x == y) { + .. + } else if (x > y) { + ... + } else { + .... + } + +Rationale: K&R. + +Also, note that this brace-placement also minimizes the number of empty +(or almost empty) lines, without any loss of readability. Thus, as the +supply of new-lines on your screen is not a renewable resource (think +25-line terminal screens here), you have more empty lines to put +comments on. + + + Chapter 4: Naming + +C is a Spartan language, and so should your naming be. Unlike Modula-2 +and Pascal programmers, C programmers do not use cute names like +ThisVariableIsATemporaryCounter. A C programmer would call that +variable "tmp", which is much easier to write, and not the least more +difficult to understand. + +HOWEVER, while mixed-case names are frowned upon, descriptive names for +global variables are a must. To call a global function "foo" is a +shooting offense. + +GLOBAL variables (to be used only if you _really_ need them) need to +have descriptive names, as do global functions. If you have a function +that counts the number of active users, you should call that +"count_active_users()" or similar, you should _not_ call it "cntusr()". + +Encoding the type of a function into the name (so-called Hungarian +notation) is brain damaged - the compiler knows the types anyway and can +check those, and it only confuses the programmer. No wonder MicroSoft +makes buggy programs. + +LOCAL variable names should be short, and to the point. If you have +some random integer loop counter, it should probably be called "i". +Calling it "loop_counter" is non-productive, if there is no chance of it +being mis-understood. Similarly, "tmp" can be just about any type of +variable that is used to hold a temporary value. + +If you are afraid to mix up your local variable names, you have another +problem, which is called the function-growth-hormone-imbalance syndrome. +See next chapter. + + + Chapter 5: Functions + +Functions should be short and sweet, and do just one thing. They should +fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24, +as we all know), and do one thing and do that well. + +The maximum length of a function is inversely proportional to the +complexity and indentation level of that function. So, if you have a +conceptually simple function that is just one long (but simple) +case-statement, where you have to do lots of small things for a lot of +different cases, it's OK to have a longer function. + +However, if you have a complex function, and you suspect that a +less-than-gifted first-year high-school student might not even +understand what the function is all about, you should adhere to the +maximum limits all the more closely. Use helper functions with +descriptive names (you can ask the compiler to in-line them if you think +it's performance-critical, and it will probably do a better job of it +than you would have done). + +Another measure of the function is the number of local variables. They +shouldn't exceed 5-10, or you're doing something wrong. Re-think the +function, and split it into smaller pieces. A human brain can +generally easily keep track of about 7 different things, anything more +and it gets confused. You know you're brilliant, but maybe you'd like +to understand what you did 2 weeks from now. + + + Chapter 6: Centralized exiting of functions + +Albeit deprecated by some people, the equivalent of the goto statement is +used frequently by compilers in form of the unconditional jump instruction. + +The goto statement comes in handy when a function exits from multiple +locations and some common work such as cleanup has to be done. + +The rationale is: + +- unconditional statements are easier to understand and follow +- nesting is reduced +- errors by not updating individual exit points when making + modifications are prevented +- saves the compiler work to optimize redundant code away ;) + +int fun(int ) +{ + int result = 0; + char *buffer = kmalloc(SIZE); + + if (buffer == NULL) + return -ENOMEM; + + if (condition1) { + while (loop1) { + ... + } + result = 1; + goto out; + } + ... +out: + kfree(buffer); + return result; +} + + Chapter 7: Commenting + +Comments are good, but there is also a danger of over-commenting. NEVER +try to explain HOW your code works in a comment: it's much better to +write the code so that the _working_ is obvious, and it's a waste of +time to explain badly written code. + +Generally, you want your comments to tell WHAT your code does, not HOW. +Also, try to avoid putting comments inside a function body: if the +function is so complex that you need to separately comment parts of it, +you should probably go back to chapter 5 for a while. You can make +small comments to note or warn about something particularly clever (or +ugly), but try to avoid excess. Instead, put the comments at the head +of the function, telling people what it does, and possibly WHY it does +it. + + + Chapter 8: You've made a mess of it + +That's OK, we all do. You've probably been told by your long-time Unix +user helper that "GNU emacs" automatically formats the C sources for +you, and you've noticed that yes, it does do that, but the defaults it +uses are less than desirable (in fact, they are worse than random +typing - an infinite number of monkeys typing into GNU emacs would never +make a good program). + +So, you can either get rid of GNU emacs, or change it to use saner +values. To do the latter, you can stick the following in your .emacs file: + +(defun linux-c-mode () + "C mode with adjusted defaults for use with the Linux kernel." + (interactive) + (c-mode) + (c-set-style "K&R") + (setq tab-width 8) + (setq indent-tabs-mode t) + (setq c-basic-offset 8)) + +This will define the M-x linux-c-mode command. When hacking on a +module, if you put the string -*- linux-c -*- somewhere on the first +two lines, this mode will be automatically invoked. Also, you may want +to add + +(setq auto-mode-alist (cons '("/usr/src/linux.*/.*\\.[ch]$" . linux-c-mode) + auto-mode-alist)) + +to your .emacs file if you want to have linux-c-mode switched on +automagically when you edit source files under /usr/src/linux. + +But even if you fail in getting emacs to do sane formatting, not +everything is lost: use "indent". + +Now, again, GNU indent has the same brain-dead settings that GNU emacs +has, which is why you need to give it a few command line options. +However, that's not too bad, because even the makers of GNU indent +recognize the authority of K&R (the GNU people aren't evil, they are +just severely misguided in this matter), so you just give indent the +options "-kr -i8" (stands for "K&R, 8 character indents"), or use +"scripts/Lindent", which indents in the latest style. + +"indent" has a lot of options, and especially when it comes to comment +re-formatting you may want to take a look at the man page. But +remember: "indent" is not a fix for bad programming. + + + Chapter 9: Configuration-files + +For configuration options (arch/xxx/Kconfig, and all the Kconfig files), +somewhat different indentation is used. + +Help text is indented with 2 spaces. + +if CONFIG_EXPERIMENTAL + tristate CONFIG_BOOM + default n + help + Apply nitroglycerine inside the keyboard (DANGEROUS) + bool CONFIG_CHEER + depends on CONFIG_BOOM + default y + help + Output nice messages when you explode +endif + +Generally, CONFIG_EXPERIMENTAL should surround all options not considered +stable. All options that are known to trash data (experimental write- +support for file-systems, for instance) should be denoted (DANGEROUS), other +experimental options should be denoted (EXPERIMENTAL). + + + Chapter 10: Data structures + +Data structures that have visibility outside the single-threaded +environment they are created and destroyed in should always have +reference counts. In the kernel, garbage collection doesn't exist (and +outside the kernel garbage collection is slow and inefficient), which +means that you absolutely _have_ to reference count all your uses. + +Reference counting means that you can avoid locking, and allows multiple +users to have access to the data structure in parallel - and not having +to worry about the structure suddenly going away from under them just +because they slept or did something else for a while. + +Note that locking is _not_ a replacement for reference counting. +Locking is used to keep data structures coherent, while reference +counting is a memory management technique. Usually both are needed, and +they are not to be confused with each other. + +Many data structures can indeed have two levels of reference counting, +when there are users of different "classes". The subclass count counts +the number of subclass users, and decrements the global count just once +when the subclass count goes to zero. + +Examples of this kind of "multi-level-reference-counting" can be found in +memory management ("struct mm_struct": mm_users and mm_count), and in +filesystem code ("struct super_block": s_count and s_active). + +Remember: if another thread can find your data structure, and you don't +have a reference count on it, you almost certainly have a bug. + + + Chapter 11: Macros, Enums, Inline functions and RTL + +Names of macros defining constants and labels in enums are capitalized. + +#define CONSTANT 0x12345 + +Enums are preferred when defining several related constants. + +CAPITALIZED macro names are appreciated but macros resembling functions +may be named in lower case. + +Generally, inline functions are preferable to macros resembling functions. + +Macros with multiple statements should be enclosed in a do - while block: + +#define macrofun(a, b, c) \ + do { \ + if (a == 5) \ + do_this(b, c); \ + } while (0) + +Things to avoid when using macros: + +1) macros that affect control flow: + +#define FOO(x) \ + do { \ + if (blah(x) < 0) \ + return -EBUGGERED; \ + } while(0) + +is a _very_ bad idea. It looks like a function call but exits the "calling" +function; don't break the internal parsers of those who will read the code. + +2) macros that depend on having a local variable with a magic name: + +#define FOO(val) bar(index, val) + +might look like a good thing, but it's confusing as hell when one reads the +code and it's prone to breakage from seemingly innocent changes. + +3) macros with arguments that are used as l-values: FOO(x) = y; will +bite you if somebody e.g. turns FOO into an inline function. + +4) forgetting about precedence: macros defining constants using expressions +must enclose the expression in parentheses. Beware of similar issues with +macros using parameters. + +#define CONSTANT 0x4000 +#define CONSTEXP (CONSTANT | 3) + +The cpp manual deals with macros exhaustively. The gcc internals manual also +covers RTL which is used frequently with assembly language in the kernel. + + + Chapter 12: Printing kernel messages + +Kernel developers like to be seen as literate. Do mind the spelling +of kernel messages to make a good impression. Do not use crippled +words like "dont" and use "do not" or "don't" instead. + +Kernel messages do not have to be terminated with a period. + +Printing numbers in parentheses (%d) adds no value and should be avoided. + + + Chapter 13: References + +The C Programming Language, Second Edition +by Brian W. Kernighan and Dennis M. Ritchie. +Prentice Hall, Inc., 1988. +ISBN 0-13-110362-8 (paperback), 0-13-110370-9 (hardback). +URL: http://cm.bell-labs.com/cm/cs/cbook/ + +The Practice of Programming +by Brian W. Kernighan and Rob Pike. +Addison-Wesley, Inc., 1999. +ISBN 0-201-61586-X. +URL: http://cm.bell-labs.com/cm/cs/tpop/ + +GNU manuals - where in compliance with K&R and this text - for cpp, gcc, +gcc internals and indent, all available from http://www.gnu.org + +WG14 is the international standardization working group for the programming +language C, URL: http://std.dkuug.dk/JTC1/SC22/WG14/ + +-- +Last updated on 16 February 2004 by a community effort on LKML. diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt new file mode 100644 index 000000000000..6ee3cd6134df --- /dev/null +++ b/Documentation/DMA-API.txt @@ -0,0 +1,526 @@ + Dynamic DMA mapping using the generic device + ============================================ + + James E.J. Bottomley <James.Bottomley@HansenPartnership.com> + +This document describes the DMA API. For a more gentle introduction +phrased in terms of the pci_ equivalents (and actual examples) see +DMA-mapping.txt + +This API is split into two pieces. Part I describes the API and the +corresponding pci_ API. Part II describes the extensions to the API +for supporting non-consistent memory machines. Unless you know that +your driver absolutely has to support non-consistent platforms (this +is usually only legacy platforms) you should only use the API +described in part I. + +Part I - pci_ and dma_ Equivalent API +------------------------------------- + +To get the pci_ API, you must #include <linux/pci.h> +To get the dma_ API, you must #include <linux/dma-mapping.h> + + +Part Ia - Using large dma-coherent buffers +------------------------------------------ + +void * +dma_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, int flag) +void * +pci_alloc_consistent(struct pci_dev *dev, size_t size, + dma_addr_t *dma_handle) + +Consistent memory is memory for which a write by either the device or +the processor can immediately be read by the processor or device +without having to worry about caching effects. + +This routine allocates a region of <size> bytes of consistent memory. +it also returns a <dma_handle> which may be cast to an unsigned +integer the same width as the bus and used as the physical address +base of the region. + +Returns: a pointer to the allocated region (in the processor's virtual +address space) or NULL if the allocation failed. + +Note: consistent memory can be expensive on some platforms, and the +minimum allocation length may be as big as a page, so you should +consolidate your requests for consistent memory as much as possible. +The simplest way to do that is to use the dma_pool calls (see below). + +The flag parameter (dma_alloc_coherent only) allows the caller to +specify the GFP_ flags (see kmalloc) for the allocation (the +implementation may chose to ignore flags that affect the location of +the returned memory, like GFP_DMA). For pci_alloc_consistent, you +must assume GFP_ATOMIC behaviour. + +void +dma_free_coherent(struct device *dev, size_t size, void *cpu_addr + dma_addr_t dma_handle) +void +pci_free_consistent(struct pci_dev *dev, size_t size, void *cpu_addr + dma_addr_t dma_handle) + +Free the region of consistent memory you previously allocated. dev, +size and dma_handle must all be the same as those passed into the +consistent allocate. cpu_addr must be the virtual address returned by +the consistent allocate + + +Part Ib - Using small dma-coherent buffers +------------------------------------------ + +To get this part of the dma_ API, you must #include <linux/dmapool.h> + +Many drivers need lots of small dma-coherent memory regions for DMA +descriptors or I/O buffers. Rather than allocating in units of a page +or more using dma_alloc_coherent(), you can use DMA pools. These work +much like a kmem_cache_t, except that they use the dma-coherent allocator +not __get_free_pages(). Also, they understand common hardware constraints +for alignment, like queue heads needing to be aligned on N byte boundaries. + + + struct dma_pool * + dma_pool_create(const char *name, struct device *dev, + size_t size, size_t align, size_t alloc); + + struct pci_pool * + pci_pool_create(const char *name, struct pci_device *dev, + size_t size, size_t align, size_t alloc); + +The pool create() routines initialize a pool of dma-coherent buffers +for use with a given device. It must be called in a context which +can sleep. + +The "name" is for diagnostics (like a kmem_cache_t name); dev and size +are like what you'd pass to dma_alloc_coherent(). The device's hardware +alignment requirement for this type of data is "align" (which is expressed +in bytes, and must be a power of two). If your device has no boundary +crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated +from this pool must not cross 4KByte boundaries. + + + void *dma_pool_alloc(struct dma_pool *pool, int gfp_flags, + dma_addr_t *dma_handle); + + void *pci_pool_alloc(struct pci_pool *pool, int gfp_flags, + dma_addr_t *dma_handle); + +This allocates memory from the pool; the returned memory will meet the size +and alignment requirements specified at creation time. Pass GFP_ATOMIC to +prevent blocking, or if it's permitted (not in_interrupt, not holding SMP locks) +pass GFP_KERNEL to allow blocking. Like dma_alloc_coherent(), this returns +two values: an address usable by the cpu, and the dma address usable by the +pool's device. + + + void dma_pool_free(struct dma_pool *pool, void *vaddr, + dma_addr_t addr); + + void pci_pool_free(struct pci_pool *pool, void *vaddr, + dma_addr_t addr); + +This puts memory back into the pool. The pool is what was passed to +the the pool allocation routine; the cpu and dma addresses are what +were returned when that routine allocated the memory being freed. + + + void dma_pool_destroy(struct dma_pool *pool); + + void pci_pool_destroy(struct pci_pool *pool); + +The pool destroy() routines free the resources of the pool. They must be +called in a context which can sleep. Make sure you've freed all allocated +memory back to the pool before you destroy it. + + +Part Ic - DMA addressing limitations +------------------------------------ + +int +dma_supported(struct device *dev, u64 mask) +int +pci_dma_supported(struct device *dev, u64 mask) + +Checks to see if the device can support DMA to the memory described by +mask. + +Returns: 1 if it can and 0 if it can't. + +Notes: This routine merely tests to see if the mask is possible. It +won't change the current mask settings. It is more intended as an +internal API for use by the platform than an external API for use by +driver writers. + +int +dma_set_mask(struct device *dev, u64 mask) +int +pci_set_dma_mask(struct pci_device *dev, u64 mask) + +Checks to see if the mask is possible and updates the device +parameters if it is. + +Returns: 0 if successful and a negative error if not. + +u64 +dma_get_required_mask(struct device *dev) + +After setting the mask with dma_set_mask(), this API returns the +actual mask (within that already set) that the platform actually +requires to operate efficiently. Usually this means the returned mask +is the minimum required to cover all of memory. Examining the +required mask gives drivers with variable descriptor sizes the +opportunity to use smaller descriptors as necessary. + +Requesting the required mask does not alter the current mask. If you +wish to take advantage of it, you should issue another dma_set_mask() +call to lower the mask again. + + +Part Id - Streaming DMA mappings +-------------------------------- + +dma_addr_t +dma_map_single(struct device *dev, void *cpu_addr, size_t size, + enum dma_data_direction direction) +dma_addr_t +pci_map_single(struct device *dev, void *cpu_addr, size_t size, + int direction) + +Maps a piece of processor virtual memory so it can be accessed by the +device and returns the physical handle of the memory. + +The direction for both api's may be converted freely by casting. +However the dma_ API uses a strongly typed enumerator for its +direction: + +DMA_NONE = PCI_DMA_NONE no direction (used for + debugging) +DMA_TO_DEVICE = PCI_DMA_TODEVICE data is going from the + memory to the device +DMA_FROM_DEVICE = PCI_DMA_FROMDEVICE data is coming from + the device to the + memory +DMA_BIDIRECTIONAL = PCI_DMA_BIDIRECTIONAL direction isn't known + +Notes: Not all memory regions in a machine can be mapped by this +API. Further, regions that appear to be physically contiguous in +kernel virtual space may not be contiguous as physical memory. Since +this API does not provide any scatter/gather capability, it will fail +if the user tries to map a non physically contiguous piece of memory. +For this reason, it is recommended that memory mapped by this API be +obtained only from sources which guarantee to be physically contiguous +(like kmalloc). + +Further, the physical address of the memory must be within the +dma_mask of the device (the dma_mask represents a bit mask of the +addressable region for the device. i.e. if the physical address of +the memory anded with the dma_mask is still equal to the physical +address, then the device can perform DMA to the memory). In order to +ensure that the memory allocated by kmalloc is within the dma_mask, +the driver may specify various platform dependent flags to restrict +the physical memory range of the allocation (e.g. on x86, GFP_DMA +guarantees to be within the first 16Mb of available physical memory, +as required by ISA devices). + +Note also that the above constraints on physical contiguity and +dma_mask may not apply if the platform has an IOMMU (a device which +supplies a physical to virtual mapping between the I/O memory bus and +the device). However, to be portable, device driver writers may *not* +assume that such an IOMMU exists. + +Warnings: Memory coherency operates at a granularity called the cache +line width. In order for memory mapped by this API to operate +correctly, the mapped region must begin exactly on a cache line +boundary and end exactly on one (to prevent two separately mapped +regions from sharing a single cache line). Since the cache line size +may not be known at compile time, the API will not enforce this +requirement. Therefore, it is recommended that driver writers who +don't take special care to determine the cache line size at run time +only map virtual regions that begin and end on page boundaries (which +are guaranteed also to be cache line boundaries). + +DMA_TO_DEVICE synchronisation must be done after the last modification +of the memory region by the software and before it is handed off to +the driver. Once this primitive is used. Memory covered by this +primitive should be treated as read only by the device. If the device +may write to it at any point, it should be DMA_BIDIRECTIONAL (see +below). + +DMA_FROM_DEVICE synchronisation must be done before the driver +accesses data that may be changed by the device. This memory should +be treated as read only by the driver. If the driver needs to write +to it at any point, it should be DMA_BIDIRECTIONAL (see below). + +DMA_BIDIRECTIONAL requires special handling: it means that the driver +isn't sure if the memory was modified before being handed off to the +device and also isn't sure if the device will also modify it. Thus, +you must always sync bidirectional memory twice: once before the +memory is handed off to the device (to make sure all memory changes +are flushed from the processor) and once before the data may be +accessed after being used by the device (to make sure any processor +cache lines are updated with data that the device may have changed. + +void +dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, + enum dma_data_direction direction) +void +pci_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_addr, + size_t size, int direction) + +Unmaps the region previously mapped. All the parameters passed in +must be identical to those passed in (and returned) by the mapping +API. + +dma_addr_t +dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction direction) +dma_addr_t +pci_map_page(struct pci_dev *hwdev, struct page *page, + unsigned long offset, size_t size, int direction) +void +dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, + enum dma_data_direction direction) +void +pci_unmap_page(struct pci_dev *hwdev, dma_addr_t dma_address, + size_t size, int direction) + +API for mapping and unmapping for pages. All the notes and warnings +for the other mapping APIs apply here. Also, although the <offset> +and <size> parameters are provided to do partial page mapping, it is +recommended that you never use these unless you really know what the +cache width is. + +int +dma_mapping_error(dma_addr_t dma_addr) + +int +pci_dma_mapping_error(dma_addr_t dma_addr) + +In some circumstances dma_map_single and dma_map_page will fail to create +a mapping. A driver can check for these errors by testing the returned +dma address with dma_mapping_error(). A non zero return value means the mapping +could not be created and the driver should take appropriate action (eg +reduce current DMA mapping usage or delay and try again later). + +int +dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction direction) +int +pci_map_sg(struct pci_dev *hwdev, struct scatterlist *sg, + int nents, int direction) + +Maps a scatter gather list from the block layer. + +Returns: the number of physical segments mapped (this may be shorted +than <nents> passed in if the block layer determines that some +elements of the scatter/gather list are physically adjacent and thus +may be mapped with a single entry). + +Please note that the sg cannot be mapped again if it has been mapped once. +The mapping process is allowed to destroy information in the sg. + +As with the other mapping interfaces, dma_map_sg can fail. When it +does, 0 is returned and a driver must take appropriate action. It is +critical that the driver do something, in the case of a block driver +aborting the request or even oopsing is better than doing nothing and +corrupting the filesystem. + +void +dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nhwentries, + enum dma_data_direction direction) +void +pci_unmap_sg(struct pci_dev *hwdev, struct scatterlist *sg, + int nents, int direction) + +unmap the previously mapped scatter/gather list. All the parameters +must be the same as those and passed in to the scatter/gather mapping +API. + +Note: <nents> must be the number you passed in, *not* the number of +physical entries returned. + +void +dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size, + enum dma_data_direction direction) +void +pci_dma_sync_single(struct pci_dev *hwdev, dma_addr_t dma_handle, + size_t size, int direction) +void +dma_sync_sg(struct device *dev, struct scatterlist *sg, int nelems, + enum dma_data_direction direction) +void +pci_dma_sync_sg(struct pci_dev *hwdev, struct scatterlist *sg, + int nelems, int direction) + +synchronise a single contiguous or scatter/gather mapping. All the +parameters must be the same as those passed into the single mapping +API. + +Notes: You must do this: + +- Before reading values that have been written by DMA from the device + (use the DMA_FROM_DEVICE direction) +- After writing values that will be written to the device using DMA + (use the DMA_TO_DEVICE) direction +- before *and* after handing memory to the device if the memory is + DMA_BIDIRECTIONAL + +See also dma_map_single(). + + +Part II - Advanced dma_ usage +----------------------------- + +Warning: These pieces of the DMA API have no PCI equivalent. They +should also not be used in the majority of cases, since they cater for +unlikely corner cases that don't belong in usual drivers. + +If you don't understand how cache line coherency works between a +processor and an I/O device, you should not be using this part of the +API at all. + +void * +dma_alloc_noncoherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, int flag) + +Identical to dma_alloc_coherent() except that the platform will +choose to return either consistent or non-consistent memory as it sees +fit. By using this API, you are guaranteeing to the platform that you +have all the correct and necessary sync points for this memory in the +driver should it choose to return non-consistent memory. + +Note: where the platform can return consistent memory, it will +guarantee that the sync points become nops. + +Warning: Handling non-consistent memory is a real pain. You should +only ever use this API if you positively know your driver will be +required to work on one of the rare (usually non-PCI) architectures +that simply cannot make consistent memory. + +void +dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle) + +free memory allocated by the nonconsistent API. All parameters must +be identical to those passed in (and returned by +dma_alloc_noncoherent()). + +int +dma_is_consistent(dma_addr_t dma_handle) + +returns true if the memory pointed to by the dma_handle is actually +consistent. + +int +dma_get_cache_alignment(void) + +returns the processor cache alignment. This is the absolute minimum +alignment *and* width that you must observe when either mapping +memory or doing partial flushes. + +Notes: This API may return a number *larger* than the actual cache +line, but it will guarantee that one or more cache lines fit exactly +into the width returned by this call. It will also always be a power +of two for easy alignment + +void +dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, + unsigned long offset, size_t size, + enum dma_data_direction direction) + +does a partial sync. starting at offset and continuing for size. You +must be careful to observe the cache alignment and width when doing +anything like this. You must also be extra careful about accessing +memory you intend to sync partially. + +void +dma_cache_sync(void *vaddr, size_t size, + enum dma_data_direction direction) + +Do a partial sync of memory that was allocated by +dma_alloc_noncoherent(), starting at virtual address vaddr and +continuing on for size. Again, you *must* observe the cache line +boundaries when doing this. + +int +dma_declare_coherent_memory(struct device *dev, dma_addr_t bus_addr, + dma_addr_t device_addr, size_t size, int + flags) + + +Declare region of memory to be handed out by dma_alloc_coherent when +it's asked for coherent memory for this device. + +bus_addr is the physical address to which the memory is currently +assigned in the bus responding region (this will be used by the +platform to perform the mapping) + +device_addr is the physical address the device needs to be programmed +with actually to address this memory (this will be handed out as the +dma_addr_t in dma_alloc_coherent()) + +size is the size of the area (must be multiples of PAGE_SIZE). + +flags can be or'd together and are + +DMA_MEMORY_MAP - request that the memory returned from +dma_alloc_coherent() be directly writeable. + +DMA_MEMORY_IO - request that the memory returned from +dma_alloc_coherent() be addressable using read/write/memcpy_toio etc. + +One or both of these flags must be present + +DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by +dma_alloc_coherent of any child devices of this one (for memory residing +on a bridge). + +DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions. +Do not allow dma_alloc_coherent() to fall back to system memory when +it's out of memory in the declared region. + +The return value will be either DMA_MEMORY_MAP or DMA_MEMORY_IO and +must correspond to a passed in flag (i.e. no returning DMA_MEMORY_IO +if only DMA_MEMORY_MAP were passed in) for success or zero for +failure. + +Note, for DMA_MEMORY_IO returns, all subsequent memory returned by +dma_alloc_coherent() may no longer be accessed directly, but instead +must be accessed using the correct bus functions. If your driver +isn't prepared to handle this contingency, it should not specify +DMA_MEMORY_IO in the input flags. + +As a simplification for the platforms, only *one* such region of +memory may be declared per device. + +For reasons of efficiency, most platforms choose to track the declared +region only at the granularity of a page. For smaller allocations, +you should use the dma_pool() API. + +void +dma_release_declared_memory(struct device *dev) + +Remove the memory region previously declared from the system. This +API performs *no* in-use checking for this region and will return +unconditionally having removed all the required structures. It is the +drivers job to ensure that no parts of this memory region are +currently in use. + +void * +dma_mark_declared_memory_occupied(struct device *dev, + dma_addr_t device_addr, size_t size) + +This is used to occupy specific regions of the declared space +(dma_alloc_coherent() will hand out the first free region it finds). + +device_addr is the *device* address of the region requested + +size is the size (and should be a page sized multiple). + +The return value will be either a pointer to the processor virtual +address of the memory, or an error (via PTR_ERR()) if any part of the +region is occupied. + + diff --git a/Documentation/DMA-mapping.txt b/Documentation/DMA-mapping.txt new file mode 100644 index 000000000000..f4ac37f157ea --- /dev/null +++ b/Documentation/DMA-mapping.txt @@ -0,0 +1,881 @@ + Dynamic DMA mapping + =================== + + David S. Miller <davem@redhat.com> + Richard Henderson <rth@cygnus.com> + Jakub Jelinek <jakub@redhat.com> + +This document describes the DMA mapping system in terms of the pci_ +API. For a similar API that works for generic devices, see +DMA-API.txt. + +Most of the 64bit platforms have special hardware that translates bus +addresses (DMA addresses) into physical addresses. This is similar to +how page tables and/or a TLB translates virtual addresses to physical +addresses on a CPU. This is needed so that e.g. PCI devices can +access with a Single Address Cycle (32bit DMA address) any page in the +64bit physical address space. Previously in Linux those 64bit +platforms had to set artificial limits on the maximum RAM size in the +system, so that the virt_to_bus() static scheme works (the DMA address +translation tables were simply filled on bootup to map each bus +address to the physical page __pa(bus_to_virt())). + +So that Linux can use the dynamic DMA mapping, it needs some help from the +drivers, namely it has to take into account that DMA addresses should be +mapped only for the time they are actually used and unmapped after the DMA +transfer. + +The following API will work of course even on platforms where no such +hardware exists, see e.g. include/asm-i386/pci.h for how it is implemented on +top of the virt_to_bus interface. + +First of all, you should make sure + +#include <linux/pci.h> + +is in your driver. This file will obtain for you the definition of the +dma_addr_t (which can hold any valid DMA address for the platform) +type which should be used everywhere you hold a DMA (bus) address +returned from the DMA mapping functions. + + What memory is DMA'able? + +The first piece of information you must know is what kernel memory can +be used with the DMA mapping facilities. There has been an unwritten +set of rules regarding this, and this text is an attempt to finally +write them down. + +If you acquired your memory via the page allocator +(i.e. __get_free_page*()) or the generic memory allocators +(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from +that memory using the addresses returned from those routines. + +This means specifically that you may _not_ use the memory/addresses +returned from vmalloc() for DMA. It is possible to DMA to the +_underlying_ memory mapped into a vmalloc() area, but this requires +walking page tables to get the physical addresses, and then +translating each of those pages back to a kernel address using +something like __va(). [ EDIT: Update this when we integrate +Gerd Knorr's generic code which does this. ] + +This rule also means that you may not use kernel image addresses +(ie. items in the kernel's data/text/bss segment, or your driver's) +nor may you use kernel stack addresses for DMA. Both of these items +might be mapped somewhere entirely different than the rest of physical +memory. + +Also, this means that you cannot take the return of a kmap() +call and DMA to/from that. This is similar to vmalloc(). + +What about block I/O and networking buffers? The block I/O and +networking subsystems make sure that the buffers they use are valid +for you to DMA from/to. + + DMA addressing limitations + +Does your device have any DMA addressing limitations? For example, is +your device only capable of driving the low order 24-bits of address +on the PCI bus for SAC DMA transfers? If so, you need to inform the +PCI layer of this fact. + +By default, the kernel assumes that your device can address the full +32-bits in a SAC cycle. For a 64-bit DAC capable device, this needs +to be increased. And for a device with limitations, as discussed in +the previous paragraph, it needs to be decreased. + +pci_alloc_consistent() by default will return 32-bit DMA addresses. +PCI-X specification requires PCI-X devices to support 64-bit +addressing (DAC) for all transactions. And at least one platform (SGI +SN2) requires 64-bit consistent allocations to operate correctly when +the IO bus is in PCI-X mode. Therefore, like with pci_set_dma_mask(), +it's good practice to call pci_set_consistent_dma_mask() to set the +appropriate mask even if your device only supports 32-bit DMA +(default) and especially if it's a PCI-X device. + +For correct operation, you must interrogate the PCI layer in your +device probe routine to see if the PCI controller on the machine can +properly support the DMA addressing limitation your device has. It is +good style to do this even if your device holds the default setting, +because this shows that you did think about these issues wrt. your +device. + +The query is performed via a call to pci_set_dma_mask(): + + int pci_set_dma_mask(struct pci_dev *pdev, u64 device_mask); + +The query for consistent allocations is performed via a a call to +pci_set_consistent_dma_mask(): + + int pci_set_consistent_dma_mask(struct pci_dev *pdev, u64 device_mask); + +Here, pdev is a pointer to the PCI device struct of your device, and +device_mask is a bit mask describing which bits of a PCI address your +device supports. It returns zero if your card can perform DMA +properly on the machine given the address mask you provided. + +If it returns non-zero, your device can not perform DMA properly on +this platform, and attempting to do so will result in undefined +behavior. You must either use a different mask, or not use DMA. + +This means that in the failure case, you have three options: + +1) Use another DMA mask, if possible (see below). +2) Use some non-DMA mode for data transfer, if possible. +3) Ignore this device and do not initialize it. + +It is recommended that your driver print a kernel KERN_WARNING message +when you end up performing either #2 or #3. In this manner, if a user +of your driver reports that performance is bad or that the device is not +even detected, you can ask them for the kernel messages to find out +exactly why. + +The standard 32-bit addressing PCI device would do something like +this: + + if (pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +Another common scenario is a 64-bit capable device. The approach +here is to try for 64-bit DAC addressing, but back down to a +32-bit mask should that fail. The PCI platform code may fail the +64-bit mask not because the platform is not capable of 64-bit +addressing. Rather, it may fail in this case simply because +32-bit SAC addressing is done more efficiently than DAC addressing. +Sparc64 is one platform which behaves in this way. + +Here is how you would handle a 64-bit capable device which can drive +all 64-bits when accessing streaming DMA: + + int using_dac; + + if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) { + using_dac = 1; + } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { + using_dac = 0; + } else { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +If a card is capable of using 64-bit consistent allocations as well, +the case would look like this: + + int using_dac, consistent_using_dac; + + if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) { + using_dac = 1; + consistent_using_dac = 1; + pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { + using_dac = 0; + consistent_using_dac = 0; + pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + } else { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +pci_set_consistent_dma_mask() will always be able to set the same or a +smaller mask as pci_set_dma_mask(). However for the rare case that a +device driver only uses consistent allocations, one would have to +check the return value from pci_set_consistent_dma_mask(). + +If your 64-bit device is going to be an enormous consumer of DMA +mappings, this can be problematic since the DMA mappings are a +finite resource on many platforms. Please see the "DAC Addressing +for Address Space Hungry Devices" section near the end of this +document for how to handle this case. + +Finally, if your device can only drive the low 24-bits of +address during PCI bus mastering you might do something like: + + if (pci_set_dma_mask(pdev, 0x00ffffff)) { + printk(KERN_WARNING + "mydev: 24-bit DMA addressing not available.\n"); + goto ignore_this_device; + } + +When pci_set_dma_mask() is successful, and returns zero, the PCI layer +saves away this mask you have provided. The PCI layer will use this +information later when you make DMA mappings. + +There is a case which we are aware of at this time, which is worth +mentioning in this documentation. If your device supports multiple +functions (for example a sound card provides playback and record +functions) and the various different functions have _different_ +DMA addressing limitations, you may wish to probe each mask and +only provide the functionality which the machine can handle. It +is important that the last call to pci_set_dma_mask() be for the +most specific mask. + +Here is pseudo-code showing how this might be done: + + #define PLAYBACK_ADDRESS_BITS DMA_32BIT_MASK + #define RECORD_ADDRESS_BITS 0x00ffffff + + struct my_sound_card *card; + struct pci_dev *pdev; + + ... + if (!pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) { + card->playback_enabled = 1; + } else { + card->playback_enabled = 0; + printk(KERN_WARN "%s: Playback disabled due to DMA limitations.\n", + card->name); + } + if (!pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) { + card->record_enabled = 1; + } else { + card->record_enabled = 0; + printk(KERN_WARN "%s: Record disabled due to DMA limitations.\n", + card->name); + } + +A sound card was used as an example here because this genre of PCI +devices seems to be littered with ISA chips given a PCI front end, +and thus retaining the 16MB DMA addressing limitations of ISA. + + Types of DMA mappings + +There are two types of DMA mappings: + +- Consistent DMA mappings which are usually mapped at driver + initialization, unmapped at the end and for which the hardware should + guarantee that the device and the CPU can access the data + in parallel and will see updates made by each other without any + explicit software flushing. + + Think of "consistent" as "synchronous" or "coherent". + + The current default is to return consistent memory in the low 32 + bits of the PCI bus space. However, for future compatibility you + should set the consistent mask even if this default is fine for your + driver. + + Good examples of what to use consistent mappings for are: + + - Network card DMA ring descriptors. + - SCSI adapter mailbox command data structures. + - Device firmware microcode executed out of + main memory. + + The invariant these examples all require is that any CPU store + to memory is immediately visible to the device, and vice + versa. Consistent mappings guarantee this. + + IMPORTANT: Consistent DMA memory does not preclude the usage of + proper memory barriers. The CPU may reorder stores to + consistent memory just as it may normal memory. Example: + if it is important for the device to see the first word + of a descriptor updated before the second, you must do + something like: + + desc->word0 = address; + wmb(); + desc->word1 = DESC_VALID; + + in order to get correct behavior on all platforms. + +- Streaming DMA mappings which are usually mapped for one DMA transfer, + unmapped right after it (unless you use pci_dma_sync_* below) and for which + hardware can optimize for sequential accesses. + + This of "streaming" as "asynchronous" or "outside the coherency + domain". + + Good examples of what to use streaming mappings for are: + + - Networking buffers transmitted/received by a device. + - Filesystem buffers written/read by a SCSI device. + + The interfaces for using this type of mapping were designed in + such a way that an implementation can make whatever performance + optimizations the hardware allows. To this end, when using + such mappings you must be explicit about what you want to happen. + +Neither type of DMA mapping has alignment restrictions that come +from PCI, although some devices may have such restrictions. + + Using Consistent DMA mappings. + +To allocate and map large (PAGE_SIZE or so) consistent DMA regions, +you should do: + + dma_addr_t dma_handle; + + cpu_addr = pci_alloc_consistent(dev, size, &dma_handle); + +where dev is a struct pci_dev *. You should pass NULL for PCI like buses +where devices don't have struct pci_dev (like ISA, EISA). This may be +called in interrupt context. + +This argument is needed because the DMA translations may be bus +specific (and often is private to the bus which the device is attached +to). + +Size is the length of the region you want to allocate, in bytes. + +This routine will allocate RAM for that region, so it acts similarly to +__get_free_pages (but takes size instead of a page order). If your +driver needs regions sized smaller than a page, you may prefer using +the pci_pool interface, described below. + +The consistent DMA mapping interfaces, for non-NULL dev, will by +default return a DMA address which is SAC (Single Address Cycle) +addressable. Even if the device indicates (via PCI dma mask) that it +may address the upper 32-bits and thus perform DAC cycles, consistent +allocation will only return > 32-bit PCI addresses for DMA if the +consistent dma mask has been explicitly changed via +pci_set_consistent_dma_mask(). This is true of the pci_pool interface +as well. + +pci_alloc_consistent returns two values: the virtual address which you +can use to access it from the CPU and dma_handle which you pass to the +card. + +The cpu return address and the DMA bus master address are both +guaranteed to be aligned to the smallest PAGE_SIZE order which +is greater than or equal to the requested size. This invariant +exists (for example) to guarantee that if you allocate a chunk +which is smaller than or equal to 64 kilobytes, the extent of the +buffer you receive will not cross a 64K boundary. + +To unmap and free such a DMA region, you call: + + pci_free_consistent(dev, size, cpu_addr, dma_handle); + +where dev, size are the same as in the above call and cpu_addr and +dma_handle are the values pci_alloc_consistent returned to you. +This function may not be called in interrupt context. + +If your driver needs lots of smaller memory regions, you can write +custom code to subdivide pages returned by pci_alloc_consistent, +or you can use the pci_pool API to do that. A pci_pool is like +a kmem_cache, but it uses pci_alloc_consistent not __get_free_pages. +Also, it understands common hardware constraints for alignment, +like queue heads needing to be aligned on N byte boundaries. + +Create a pci_pool like this: + + struct pci_pool *pool; + + pool = pci_pool_create(name, dev, size, align, alloc); + +The "name" is for diagnostics (like a kmem_cache name); dev and size +are as above. The device's hardware alignment requirement for this +type of data is "align" (which is expressed in bytes, and must be a +power of two). If your device has no boundary crossing restrictions, +pass 0 for alloc; passing 4096 says memory allocated from this pool +must not cross 4KByte boundaries (but at that time it may be better to +go for pci_alloc_consistent directly instead). + +Allocate memory from a pci pool like this: + + cpu_addr = pci_pool_alloc(pool, flags, &dma_handle); + +flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor +holding SMP locks), SLAB_ATOMIC otherwise. Like pci_alloc_consistent, +this returns two values, cpu_addr and dma_handle. + +Free memory that was allocated from a pci_pool like this: + + pci_pool_free(pool, cpu_addr, dma_handle); + +where pool is what you passed to pci_pool_alloc, and cpu_addr and +dma_handle are the values pci_pool_alloc returned. This function +may be called in interrupt context. + +Destroy a pci_pool by calling: + + pci_pool_destroy(pool); + +Make sure you've called pci_pool_free for all memory allocated +from a pool before you destroy the pool. This function may not +be called in interrupt context. + + DMA Direction + +The interfaces described in subsequent portions of this document +take a DMA direction argument, which is an integer and takes on +one of the following values: + + PCI_DMA_BIDIRECTIONAL + PCI_DMA_TODEVICE + PCI_DMA_FROMDEVICE + PCI_DMA_NONE + +One should provide the exact DMA direction if you know it. + +PCI_DMA_TODEVICE means "from main memory to the PCI device" +PCI_DMA_FROMDEVICE means "from the PCI device to main memory" +It is the direction in which the data moves during the DMA +transfer. + +You are _strongly_ encouraged to specify this as precisely +as you possibly can. + +If you absolutely cannot know the direction of the DMA transfer, +specify PCI_DMA_BIDIRECTIONAL. It means that the DMA can go in +either direction. The platform guarantees that you may legally +specify this, and that it will work, but this may be at the +cost of performance for example. + +The value PCI_DMA_NONE is to be used for debugging. One can +hold this in a data structure before you come to know the +precise direction, and this will help catch cases where your +direction tracking logic has failed to set things up properly. + +Another advantage of specifying this value precisely (outside of +potential platform-specific optimizations of such) is for debugging. +Some platforms actually have a write permission boolean which DMA +mappings can be marked with, much like page protections in the user +program address space. Such platforms can and do report errors in the +kernel logs when the PCI controller hardware detects violation of the +permission setting. + +Only streaming mappings specify a direction, consistent mappings +implicitly have a direction attribute setting of +PCI_DMA_BIDIRECTIONAL. + +The SCSI subsystem provides mechanisms for you to easily obtain +the direction to use, in the SCSI command: + + scsi_to_pci_dma_dir(SCSI_DIRECTION) + +Where SCSI_DIRECTION is obtained from the 'sc_data_direction' +member of the SCSI command your driver is working on. The +mentioned interface above returns a value suitable for passing +into the streaming DMA mapping interfaces below. + +For Networking drivers, it's a rather simple affair. For transmit +packets, map/unmap them with the PCI_DMA_TODEVICE direction +specifier. For receive packets, just the opposite, map/unmap them +with the PCI_DMA_FROMDEVICE direction specifier. + + Using Streaming DMA mappings + +The streaming DMA mapping routines can be called from interrupt +context. There are two versions of each map/unmap, one which will +map/unmap a single memory region, and one which will map/unmap a +scatterlist. + +To map a single region, you do: + + struct pci_dev *pdev = mydev->pdev; + dma_addr_t dma_handle; + void *addr = buffer->ptr; + size_t size = buffer->len; + + dma_handle = pci_map_single(dev, addr, size, direction); + +and to unmap it: + + pci_unmap_single(dev, dma_handle, size, direction); + +You should call pci_unmap_single when the DMA activity is finished, e.g. +from the interrupt which told you that the DMA transfer is done. + +Using cpu pointers like this for single mappings has a disadvantage, +you cannot reference HIGHMEM memory in this way. Thus, there is a +map/unmap interface pair akin to pci_{map,unmap}_single. These +interfaces deal with page/offset pairs instead of cpu pointers. +Specifically: + + struct pci_dev *pdev = mydev->pdev; + dma_addr_t dma_handle; + struct page *page = buffer->page; + unsigned long offset = buffer->offset; + size_t size = buffer->len; + + dma_handle = pci_map_page(dev, page, offset, size, direction); + + ... + + pci_unmap_page(dev, dma_handle, size, direction); + +Here, "offset" means byte offset within the given page. + +With scatterlists, you map a region gathered from several regions by: + + int i, count = pci_map_sg(dev, sglist, nents, direction); + struct scatterlist *sg; + + for (i = 0, sg = sglist; i < count; i++, sg++) { + hw_address[i] = sg_dma_address(sg); + hw_len[i] = sg_dma_len(sg); + } + +where nents is the number of entries in the sglist. + +The implementation is free to merge several consecutive sglist entries +into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any +consecutive sglist entries can be merged into one provided the first one +ends and the second one starts on a page boundary - in fact this is a huge +advantage for cards which either cannot do scatter-gather or have very +limited number of scatter-gather entries) and returns the actual number +of sg entries it mapped them to. On failure 0 is returned. + +Then you should loop count times (note: this can be less than nents times) +and use sg_dma_address() and sg_dma_len() macros where you previously +accessed sg->address and sg->length as shown above. + +To unmap a scatterlist, just call: + + pci_unmap_sg(dev, sglist, nents, direction); + +Again, make sure DMA activity has already finished. + +PLEASE NOTE: The 'nents' argument to the pci_unmap_sg call must be + the _same_ one you passed into the pci_map_sg call, + it should _NOT_ be the 'count' value _returned_ from the + pci_map_sg call. + +Every pci_map_{single,sg} call should have its pci_unmap_{single,sg} +counterpart, because the bus address space is a shared resource (although +in some ports the mapping is per each BUS so less devices contend for the +same bus address space) and you could render the machine unusable by eating +all bus addresses. + +If you need to use the same streaming DMA region multiple times and touch +the data in between the DMA transfers, the buffer needs to be synced +properly in order for the cpu and device to see the most uptodate and +correct copy of the DMA buffer. + +So, firstly, just map it with pci_map_{single,sg}, and after each DMA +transfer call either: + + pci_dma_sync_single_for_cpu(dev, dma_handle, size, direction); + +or: + + pci_dma_sync_sg_for_cpu(dev, sglist, nents, direction); + +as appropriate. + +Then, if you wish to let the device get at the DMA area again, +finish accessing the data with the cpu, and then before actually +giving the buffer to the hardware call either: + + pci_dma_sync_single_for_device(dev, dma_handle, size, direction); + +or: + + pci_dma_sync_sg_for_device(dev, sglist, nents, direction); + +as appropriate. + +After the last DMA transfer call one of the DMA unmap routines +pci_unmap_{single,sg}. If you don't touch the data from the first pci_map_* +call till pci_unmap_*, then you don't have to call the pci_dma_sync_* +routines at all. + +Here is pseudo code which shows a situation in which you would need +to use the pci_dma_sync_*() interfaces. + + my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) + { + dma_addr_t mapping; + + mapping = pci_map_single(cp->pdev, buffer, len, PCI_DMA_FROMDEVICE); + + cp->rx_buf = buffer; + cp->rx_len = len; + cp->rx_dma = mapping; + + give_rx_buf_to_card(cp); + } + + ... + + my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) + { + struct my_card *cp = devid; + + ... + if (read_card_status(cp) == RX_BUF_TRANSFERRED) { + struct my_card_header *hp; + + /* Examine the header to see if we wish + * to accept the data. But synchronize + * the DMA transfer with the CPU first + * so that we see updated contents. + */ + pci_dma_sync_single_for_cpu(cp->pdev, cp->rx_dma, + cp->rx_len, + PCI_DMA_FROMDEVICE); + + /* Now it is safe to examine the buffer. */ + hp = (struct my_card_header *) cp->rx_buf; + if (header_is_ok(hp)) { + pci_unmap_single(cp->pdev, cp->rx_dma, cp->rx_len, + PCI_DMA_FROMDEVICE); + pass_to_upper_layers(cp->rx_buf); + make_and_setup_new_rx_buf(cp); + } else { + /* Just sync the buffer and give it back + * to the card. + */ + pci_dma_sync_single_for_device(cp->pdev, + cp->rx_dma, + cp->rx_len, + PCI_DMA_FROMDEVICE); + give_rx_buf_to_card(cp); + } + } + } + +Drivers converted fully to this interface should not use virt_to_bus any +longer, nor should they use bus_to_virt. Some drivers have to be changed a +little bit, because there is no longer an equivalent to bus_to_virt in the +dynamic DMA mapping scheme - you have to always store the DMA addresses +returned by the pci_alloc_consistent, pci_pool_alloc, and pci_map_single +calls (pci_map_sg stores them in the scatterlist itself if the platform +supports dynamic DMA mapping in hardware) in your driver structures and/or +in the card registers. + +All PCI drivers should be using these interfaces with no exceptions. +It is planned to completely remove virt_to_bus() and bus_to_virt() as +they are entirely deprecated. Some ports already do not provide these +as it is impossible to correctly support them. + + 64-bit DMA and DAC cycle support + +Do you understand all of the text above? Great, then you already +know how to use 64-bit DMA addressing under Linux. Simply make +the appropriate pci_set_dma_mask() calls based upon your cards +capabilities, then use the mapping APIs above. + +It is that simple. + +Well, not for some odd devices. See the next section for information +about that. + + DAC Addressing for Address Space Hungry Devices + +There exists a class of devices which do not mesh well with the PCI +DMA mapping API. By definition these "mappings" are a finite +resource. The number of total available mappings per bus is platform +specific, but there will always be a reasonable amount. + +What is "reasonable"? Reasonable means that networking and block I/O +devices need not worry about using too many mappings. + +As an example of a problematic device, consider compute cluster cards. +They can potentially need to access gigabytes of memory at once via +DMA. Dynamic mappings are unsuitable for this kind of access pattern. + +To this end we've provided a small API by which a device driver +may use DAC cycles to directly address all of physical memory. +Not all platforms support this, but most do. It is easy to determine +whether the platform will work properly at probe time. + +First, understand that there may be a SEVERE performance penalty for +using these interfaces on some platforms. Therefore, you MUST only +use these interfaces if it is absolutely required. %99 of devices can +use the normal APIs without any problems. + +Note that for streaming type mappings you must either use these +interfaces, or the dynamic mapping interfaces above. You may not mix +usage of both for the same device. Such an act is illegal and is +guaranteed to put a banana in your tailpipe. + +However, consistent mappings may in fact be used in conjunction with +these interfaces. Remember that, as defined, consistent mappings are +always going to be SAC addressable. + +The first thing your driver needs to do is query the PCI platform +layer with your devices DAC addressing capabilities: + + int pci_dac_set_dma_mask(struct pci_dev *pdev, u64 mask); + +This routine behaves identically to pci_set_dma_mask. You may not +use the following interfaces if this routine fails. + +Next, DMA addresses using this API are kept track of using the +dma64_addr_t type. It is guaranteed to be big enough to hold any +DAC address the platform layer will give to you from the following +routines. If you have consistent mappings as well, you still +use plain dma_addr_t to keep track of those. + +All mappings obtained here will be direct. The mappings are not +translated, and this is the purpose of this dialect of the DMA API. + +All routines work with page/offset pairs. This is the _ONLY_ way to +portably refer to any piece of memory. If you have a cpu pointer +(which may be validly DMA'd too) you may easily obtain the page +and offset using something like this: + + struct page *page = virt_to_page(ptr); + unsigned long offset = offset_in_page(ptr); + +Here are the interfaces: + + dma64_addr_t pci_dac_page_to_dma(struct pci_dev *pdev, + struct page *page, + unsigned long offset, + int direction); + +The DAC address for the tuple PAGE/OFFSET are returned. The direction +argument is the same as for pci_{map,unmap}_single(). The same rules +for cpu/device access apply here as for the streaming mapping +interfaces. To reiterate: + + The cpu may touch the buffer before pci_dac_page_to_dma. + The device may touch the buffer after pci_dac_page_to_dma + is made, but the cpu may NOT. + +When the DMA transfer is complete, invoke: + + void pci_dac_dma_sync_single_for_cpu(struct pci_dev *pdev, + dma64_addr_t dma_addr, + size_t len, int direction); + +This must be done before the CPU looks at the buffer again. +This interface behaves identically to pci_dma_sync_{single,sg}_for_cpu(). + +And likewise, if you wish to let the device get back at the buffer after +the cpu has read/written it, invoke: + + void pci_dac_dma_sync_single_for_device(struct pci_dev *pdev, + dma64_addr_t dma_addr, + size_t len, int direction); + +before letting the device access the DMA area again. + +If you need to get back to the PAGE/OFFSET tuple from a dma64_addr_t +the following interfaces are provided: + + struct page *pci_dac_dma_to_page(struct pci_dev *pdev, + dma64_addr_t dma_addr); + unsigned long pci_dac_dma_to_offset(struct pci_dev *pdev, + dma64_addr_t dma_addr); + +This is possible with the DAC interfaces purely because they are +not translated in any way. + + Optimizing Unmap State Space Consumption + +On many platforms, pci_unmap_{single,page}() is simply a nop. +Therefore, keeping track of the mapping address and length is a waste +of space. Instead of filling your drivers up with ifdefs and the like +to "work around" this (which would defeat the whole purpose of a +portable API) the following facilities are provided. + +Actually, instead of describing the macros one by one, we'll +transform some example code. + +1) Use DECLARE_PCI_UNMAP_{ADDR,LEN} in state saving structures. + Example, before: + + struct ring_state { + struct sk_buff *skb; + dma_addr_t mapping; + __u32 len; + }; + + after: + + struct ring_state { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) + DECLARE_PCI_UNMAP_LEN(len) + }; + + NOTE: DO NOT put a semicolon at the end of the DECLARE_*() + macro. + +2) Use pci_unmap_{addr,len}_set to set these values. + Example, before: + + ringp->mapping = FOO; + ringp->len = BAR; + + after: + + pci_unmap_addr_set(ringp, mapping, FOO); + pci_unmap_len_set(ringp, len, BAR); + +3) Use pci_unmap_{addr,len} to access these values. + Example, before: + + pci_unmap_single(pdev, ringp->mapping, ringp->len, + PCI_DMA_FROMDEVICE); + + after: + + pci_unmap_single(pdev, + pci_unmap_addr(ringp, mapping), + pci_unmap_len(ringp, len), + PCI_DMA_FROMDEVICE); + +It really should be self-explanatory. We treat the ADDR and LEN +separately, because it is possible for an implementation to only +need the address in order to perform the unmap operation. + + Platform Issues + +If you are just writing drivers for Linux and do not maintain +an architecture port for the kernel, you can safely skip down +to "Closing". + +1) Struct scatterlist requirements. + + Struct scatterlist must contain, at a minimum, the following + members: + + struct page *page; + unsigned int offset; + unsigned int length; + + The base address is specified by a "page+offset" pair. + + Previous versions of struct scatterlist contained a "void *address" + field that was sometimes used instead of page+offset. As of Linux + 2.5., page+offset is always used, and the "address" field has been + deleted. + +2) More to come... + + Handling Errors + +DMA address space is limited on some architectures and an allocation +failure can be determined by: + +- checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0 + +- checking the returned dma_addr_t of pci_map_single and pci_map_page + by using pci_dma_mapping_error(): + + dma_addr_t dma_handle; + + dma_handle = pci_map_single(dev, addr, size, direction); + if (pci_dma_mapping_error(dma_handle)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + } + + Closing + +This document, and the API itself, would not be in it's current +form without the feedback and suggestions from numerous individuals. +We would like to specifically mention, in no particular order, the +following people: + + Russell King <rmk@arm.linux.org.uk> + Leo Dagum <dagum@barrel.engr.sgi.com> + Ralf Baechle <ralf@oss.sgi.com> + Grant Grundler <grundler@cup.hp.com> + Jay Estabrook <Jay.Estabrook@compaq.com> + Thomas Sailer <sailer@ife.ee.ethz.ch> + Andrea Arcangeli <andrea@suse.de> + Jens Axboe <axboe@suse.de> + David Mosberger-Tang <davidm@hpl.hp.com> diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile new file mode 100644 index 000000000000..a221039ee4c9 --- /dev/null +++ b/Documentation/DocBook/Makefile @@ -0,0 +1,195 @@ +### +# This makefile is used to generate the kernel documentation, +# primarily based on in-line comments in various source files. +# See Documentation/kernel-doc-nano-HOWTO.txt for instruction in how +# to ducument the SRC - and how to read it. +# To add a new book the only step required is to add the book to the +# list of DOCBOOKS. + +DOCBOOKS := wanbook.xml z8530book.xml mcabook.xml videobook.xml \ + kernel-hacking.xml kernel-locking.xml via-audio.xml \ + deviceiobook.xml procfs-guide.xml tulip-user.xml \ + writing_usb_driver.xml scsidrivers.xml sis900.xml \ + kernel-api.xml journal-api.xml lsm.xml usb.xml \ + gadget.xml libata.xml mtdnand.xml librs.xml + +### +# The build process is as follows (targets): +# (xmldocs) +# file.tmpl --> file.xml +--> file.ps (psdocs) +# +--> file.pdf (pdfdocs) +# +--> DIR=file (htmldocs) +# +--> man/ (mandocs) + +### +# The targets that may be used. +.PHONY: xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs + +BOOKS := $(addprefix $(obj)/,$(DOCBOOKS)) +xmldocs: $(BOOKS) +sgmldocs: xmldocs + +PS := $(patsubst %.xml, %.ps, $(BOOKS)) +psdocs: $(PS) + +PDF := $(patsubst %.xml, %.pdf, $(BOOKS)) +pdfdocs: $(PDF) + +HTML := $(patsubst %.xml, %.html, $(BOOKS)) +htmldocs: $(HTML) + +MAN := $(patsubst %.xml, %.9, $(BOOKS)) +mandocs: $(MAN) + +installmandocs: mandocs + $(MAKEMAN) install Documentation/DocBook/man + +### +#External programs used +KERNELDOC = scripts/kernel-doc +DOCPROC = scripts/basic/docproc +SPLITMAN = $(PERL) $(srctree)/scripts/split-man +MAKEMAN = $(PERL) $(srctree)/scripts/makeman + +### +# DOCPROC is used for two purposes: +# 1) To generate a dependency list for a .tmpl file +# 2) To preprocess a .tmpl file and call kernel-doc with +# appropriate parameters. +# The following rules are used to generate the .xml documentation +# required to generate the final targets. (ps, pdf, html). +quiet_cmd_docproc = DOCPROC $@ + cmd_docproc = SRCTREE=$(srctree)/ $(DOCPROC) doc $< >$@ +define rule_docproc + set -e; \ + $(if $($(quiet)cmd_$(1)),echo ' $($(quiet)cmd_$(1))';) \ + $(cmd_$(1)); \ + ( \ + echo 'cmd_$@ := $(cmd_$(1))'; \ + echo $@: `SRCTREE=$(srctree) $(DOCPROC) depend $<`; \ + ) > $(dir $@).$(notdir $@).cmd +endef + +%.xml: %.tmpl FORCE + $(call if_changed_rule,docproc) + +### +#Read in all saved dependency files +cmd_files := $(wildcard $(foreach f,$(BOOKS),$(dir $(f)).$(notdir $(f)).cmd)) + +ifneq ($(cmd_files),) + include $(cmd_files) +endif + +### +# Changes in kernel-doc force a rebuild of all documentation +$(BOOKS): $(KERNELDOC) + +### +# procfs guide uses a .c file as example code. +# This requires an explicit dependency +C-procfs-example = procfs_example.xml +C-procfs-example2 = $(addprefix $(obj)/,$(C-procfs-example)) +$(obj)/procfs-guide.xml: $(C-procfs-example2) + +### +# Rules to generate postscript, PDF and HTML +# db2html creates a directory. Generate a html file used for timestamp + +quiet_cmd_db2ps = DB2PS $@ + cmd_db2ps = db2ps -o $(dir $@) $< +%.ps : %.xml + @(which db2ps > /dev/null 2>&1) || \ + (echo "*** You need to install DocBook stylesheets ***"; \ + exit 1) + $(call cmd,db2ps) + +quiet_cmd_db2pdf = DB2PDF $@ + cmd_db2pdf = db2pdf -o $(dir $@) $< +%.pdf : %.xml + @(which db2pdf > /dev/null 2>&1) || \ + (echo "*** You need to install DocBook stylesheets ***"; \ + exit 1) + $(call cmd,db2pdf) + +quiet_cmd_db2html = DB2HTML $@ + cmd_db2html = db2html -o $(patsubst %.html,%,$@) $< && \ + echo '<a HREF="$(patsubst %.html,%,$(notdir $@))/book1.html"> \ + Goto $(patsubst %.html,%,$(notdir $@))</a><p>' > $@ + +%.html: %.xml + @(which db2html > /dev/null 2>&1) || \ + (echo "*** You need to install DocBook stylesheets ***"; \ + exit 1) + @rm -rf $@ $(patsubst %.html,%,$@) + $(call cmd,db2html) + @if [ ! -z "$(PNG-$(basename $(notdir $@)))" ]; then \ + cp $(PNG-$(basename $(notdir $@))) $(patsubst %.html,%,$@); fi + +### +# Rule to generate man files - output is placed in the man subdirectory + +%.9: %.xml +ifneq ($(KBUILD_SRC),) + $(Q)mkdir -p $(objtree)/Documentation/DocBook/man +endif + $(SPLITMAN) $< $(objtree)/Documentation/DocBook/man "$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)" + $(MAKEMAN) convert $(objtree)/Documentation/DocBook/man $< + +### +# Rules to generate postscripts and PNG imgages from .fig format files +quiet_cmd_fig2eps = FIG2EPS $@ + cmd_fig2eps = fig2dev -Leps $< $@ + +%.eps: %.fig + @(which fig2dev > /dev/null 2>&1) || \ + (echo "*** You need to install transfig ***"; \ + exit 1) + $(call cmd,fig2eps) + +quiet_cmd_fig2png = FIG2PNG $@ + cmd_fig2png = fig2dev -Lpng $< $@ + +%.png: %.fig + @(which fig2dev > /dev/null 2>&1) || \ + (echo "*** You need to install transfig ***"; \ + exit 1) + $(call cmd,fig2png) + +### +# Rule to convert a .c file to inline XML documentation +%.xml: %.c + @echo ' GEN $@' + @( \ + echo "<programlisting>"; \ + expand --tabs=8 < $< | \ + sed -e "s/&/\\&/g" \ + -e "s/</\\</g" \ + -e "s/>/\\>/g"; \ + echo "</programlisting>") > $@ + +### +# Help targets as used by the top-level makefile +dochelp: + @echo ' Linux kernel internal documentation in different formats:' + @echo ' xmldocs (XML DocBook), psdocs (Postscript), pdfdocs (PDF)' + @echo ' htmldocs (HTML), mandocs (man pages, use installmandocs to install)' + +### +# Temporary files left by various tools +clean-files := $(DOCBOOKS) \ + $(patsubst %.xml, %.dvi, $(DOCBOOKS)) \ + $(patsubst %.xml, %.aux, $(DOCBOOKS)) \ + $(patsubst %.xml, %.tex, $(DOCBOOKS)) \ + $(patsubst %.xml, %.log, $(DOCBOOKS)) \ + $(patsubst %.xml, %.out, $(DOCBOOKS)) \ + $(patsubst %.xml, %.ps, $(DOCBOOKS)) \ + $(patsubst %.xml, %.pdf, $(DOCBOOKS)) \ + $(patsubst %.xml, %.html, $(DOCBOOKS)) \ + $(patsubst %.xml, %.9, $(DOCBOOKS)) \ + $(C-procfs-example) + +clean-dirs := $(patsubst %.xml,%,$(DOCBOOKS)) + +#man put files in man subdir - traverse down +subdir- := man/ diff --git a/Documentation/DocBook/deviceiobook.tmpl b/Documentation/DocBook/deviceiobook.tmpl new file mode 100644 index 000000000000..6f41f2f5c6f6 --- /dev/null +++ b/Documentation/DocBook/deviceiobook.tmpl @@ -0,0 +1,341 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="DoingIO"> + <bookinfo> + <title>Bus-Independent Device Accesses</title> + + <authorgroup> + <author> + <firstname>Matthew</firstname> + <surname>Wilcox</surname> + <affiliation> + <address> + <email>matthew@wil.cx</email> + </address> + </affiliation> + </author> + </authorgroup> + + <authorgroup> + <author> + <firstname>Alan</firstname> + <surname>Cox</surname> + <affiliation> + <address> + <email>alan@redhat.com</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2001</year> + <holder>Matthew Wilcox</holder> + </copyright> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="intro"> + <title>Introduction</title> + <para> + Linux provides an API which abstracts performing IO across all busses + and devices, allowing device drivers to be written independently of + bus type. + </para> + </chapter> + + <chapter id="bugs"> + <title>Known Bugs And Assumptions</title> + <para> + None. + </para> + </chapter> + + <chapter id="mmio"> + <title>Memory Mapped IO</title> + <sect1> + <title>Getting Access to the Device</title> + <para> + The most widely supported form of IO is memory mapped IO. + That is, a part of the CPU's address space is interpreted + not as accesses to memory, but as accesses to a device. Some + architectures define devices to be at a fixed address, but most + have some method of discovering devices. The PCI bus walk is a + good example of such a scheme. This document does not cover how + to receive such an address, but assumes you are starting with one. + Physical addresses are of type unsigned long. + </para> + + <para> + This address should not be used directly. Instead, to get an + address suitable for passing to the accessor functions described + below, you should call <function>ioremap</function>. + An address suitable for accessing the device will be returned to you. + </para> + + <para> + After you've finished using the device (say, in your module's + exit routine), call <function>iounmap</function> in order to return + the address space to the kernel. Most architectures allocate new + address space each time you call <function>ioremap</function>, and + they can run out unless you call <function>iounmap</function>. + </para> + </sect1> + + <sect1> + <title>Accessing the device</title> + <para> + The part of the interface most used by drivers is reading and + writing memory-mapped registers on the device. Linux provides + interfaces to read and write 8-bit, 16-bit, 32-bit and 64-bit + quantities. Due to a historical accident, these are named byte, + word, long and quad accesses. Both read and write accesses are + supported; there is no prefetch support at this time. + </para> + + <para> + The functions are named <function>readb</function>, + <function>readw</function>, <function>readl</function>, + <function>readq</function>, <function>readb_relaxed</function>, + <function>readw_relaxed</function>, <function>readl_relaxed</function>, + <function>readq_relaxed</function>, <function>writeb</function>, + <function>writew</function>, <function>writel</function> and + <function>writeq</function>. + </para> + + <para> + Some devices (such as framebuffers) would like to use larger + transfers than 8 bytes at a time. For these devices, the + <function>memcpy_toio</function>, <function>memcpy_fromio</function> + and <function>memset_io</function> functions are provided. + Do not use memset or memcpy on IO addresses; they + are not guaranteed to copy data in order. + </para> + + <para> + The read and write functions are defined to be ordered. That is the + compiler is not permitted to reorder the I/O sequence. When the + ordering can be compiler optimised, you can use <function> + __readb</function> and friends to indicate the relaxed ordering. Use + this with care. + </para> + + <para> + While the basic functions are defined to be synchronous with respect + to each other and ordered with respect to each other the busses the + devices sit on may themselves have asynchronicity. In particular many + authors are burned by the fact that PCI bus writes are posted + asynchronously. A driver author must issue a read from the same + device to ensure that writes have occurred in the specific cases the + author cares. This kind of property cannot be hidden from driver + writers in the API. In some cases, the read used to flush the device + may be expected to fail (if the card is resetting, for example). In + that case, the read should be done from config space, which is + guaranteed to soft-fail if the card doesn't respond. + </para> + + <para> + The following is an example of flushing a write to a device when + the driver would like to ensure the write's effects are visible prior + to continuing execution. + </para> + +<programlisting> +static inline void +qla1280_disable_intrs(struct scsi_qla_host *ha) +{ + struct device_reg *reg; + + reg = ha->iobase; + /* disable risc and host interrupts */ + WRT_REG_WORD(&reg->ictrl, 0); + /* + * The following read will ensure that the above write + * has been received by the device before we return from this + * function. + */ + RD_REG_WORD(&reg->ictrl); + ha->flags.ints_enabled = 0; +} +</programlisting> + + <para> + In addition to write posting, on some large multiprocessing systems + (e.g. SGI Challenge, Origin and Altix machines) posted writes won't + be strongly ordered coming from different CPUs. Thus it's important + to properly protect parts of your driver that do memory-mapped writes + with locks and use the <function>mmiowb</function> to make sure they + arrive in the order intended. Issuing a regular <function>readX + </function> will also ensure write ordering, but should only be used + when the driver has to be sure that the write has actually arrived + at the device (not that it's simply ordered with respect to other + writes), since a full <function>readX</function> is a relatively + expensive operation. + </para> + + <para> + Generally, one should use <function>mmiowb</function> prior to + releasing a spinlock that protects regions using <function>writeb + </function> or similar functions that aren't surrounded by <function> + readb</function> calls, which will ensure ordering and flushing. The + following pseudocode illustrates what might occur if write ordering + isn't guaranteed via <function>mmiowb</function> or one of the + <function>readX</function> functions. + </para> + +<programlisting> +CPU A: spin_lock_irqsave(&dev_lock, flags) +CPU A: ... +CPU A: writel(newval, ring_ptr); +CPU A: spin_unlock_irqrestore(&dev_lock, flags) + ... +CPU B: spin_lock_irqsave(&dev_lock, flags) +CPU B: writel(newval2, ring_ptr); +CPU B: ... +CPU B: spin_unlock_irqrestore(&dev_lock, flags) +</programlisting> + + <para> + In the case above, newval2 could be written to ring_ptr before + newval. Fixing it is easy though: + </para> + +<programlisting> +CPU A: spin_lock_irqsave(&dev_lock, flags) +CPU A: ... +CPU A: writel(newval, ring_ptr); +CPU A: mmiowb(); /* ensure no other writes beat us to the device */ +CPU A: spin_unlock_irqrestore(&dev_lock, flags) + ... +CPU B: spin_lock_irqsave(&dev_lock, flags) +CPU B: writel(newval2, ring_ptr); +CPU B: ... +CPU B: mmiowb(); +CPU B: spin_unlock_irqrestore(&dev_lock, flags) +</programlisting> + + <para> + See tg3.c for a real world example of how to use <function>mmiowb + </function> + </para> + + <para> + PCI ordering rules also guarantee that PIO read responses arrive + after any outstanding DMA writes from that bus, since for some devices + the result of a <function>readb</function> call may signal to the + driver that a DMA transaction is complete. In many cases, however, + the driver may want to indicate that the next + <function>readb</function> call has no relation to any previous DMA + writes performed by the device. The driver can use + <function>readb_relaxed</function> for these cases, although only + some platforms will honor the relaxed semantics. Using the relaxed + read functions will provide significant performance benefits on + platforms that support it. The qla2xxx driver provides examples + of how to use <function>readX_relaxed</function>. In many cases, + a majority of the driver's <function>readX</function> calls can + safely be converted to <function>readX_relaxed</function> calls, since + only a few will indicate or depend on DMA completion. + </para> + </sect1> + + <sect1> + <title>ISA legacy functions</title> + <para> + On older kernels (2.2 and earlier) the ISA bus could be read or + written with these functions and without ioremap being used. This is + no longer true in Linux 2.4. A set of equivalent functions exist for + easy legacy driver porting. The functions available are prefixed + with 'isa_' and are <function>isa_readb</function>, + <function>isa_writeb</function>, <function>isa_readw</function>, + <function>isa_writew</function>, <function>isa_readl</function>, + <function>isa_writel</function>, <function>isa_memcpy_fromio</function> + and <function>isa_memcpy_toio</function> + </para> + <para> + These functions should not be used in new drivers, and will + eventually be going away. + </para> + </sect1> + + </chapter> + + <chapter> + <title>Port Space Accesses</title> + <sect1> + <title>Port Space Explained</title> + + <para> + Another form of IO commonly supported is Port Space. This is a + range of addresses separate to the normal memory address space. + Access to these addresses is generally not as fast as accesses + to the memory mapped addresses, and it also has a potentially + smaller address space. + </para> + + <para> + Unlike memory mapped IO, no preparation is required + to access port space. + </para> + + </sect1> + <sect1> + <title>Accessing Port Space</title> + <para> + Accesses to this space are provided through a set of functions + which allow 8-bit, 16-bit and 32-bit accesses; also + known as byte, word and long. These functions are + <function>inb</function>, <function>inw</function>, + <function>inl</function>, <function>outb</function>, + <function>outw</function> and <function>outl</function>. + </para> + + <para> + Some variants are provided for these functions. Some devices + require that accesses to their ports are slowed down. This + functionality is provided by appending a <function>_p</function> + to the end of the function. There are also equivalents to memcpy. + The <function>ins</function> and <function>outs</function> + functions copy bytes, words or longs to the given port. + </para> + </sect1> + + </chapter> + + <chapter id="pubfunctions"> + <title>Public Functions Provided</title> +!Einclude/asm-i386/io.h + </chapter> + +</book> diff --git a/Documentation/DocBook/gadget.tmpl b/Documentation/DocBook/gadget.tmpl new file mode 100644 index 000000000000..a34442436128 --- /dev/null +++ b/Documentation/DocBook/gadget.tmpl @@ -0,0 +1,752 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="USB-Gadget-API"> + <bookinfo> + <title>USB Gadget API for Linux</title> + <date>20 August 2004</date> + <edition>20 August 2004</edition> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + <copyright> + <year>2003-2004</year> + <holder>David Brownell</holder> + </copyright> + + <author> + <firstname>David</firstname> + <surname>Brownell</surname> + <affiliation> + <address><email>dbrownell@users.sourceforge.net</email></address> + </affiliation> + </author> + </bookinfo> + +<toc></toc> + +<chapter><title>Introduction</title> + +<para>This document presents a Linux-USB "Gadget" +kernel mode +API, for use within peripherals and other USB devices +that embed Linux. +It provides an overview of the API structure, +and shows how that fits into a system development project. +This is the first such API released on Linux to address +a number of important problems, including: </para> + +<itemizedlist> + <listitem><para>Supports USB 2.0, for high speed devices which + can stream data at several dozen megabytes per second. + </para></listitem> + <listitem><para>Handles devices with dozens of endpoints just as + well as ones with just two fixed-function ones. Gadget drivers + can be written so they're easy to port to new hardware. + </para></listitem> + <listitem><para>Flexible enough to expose more complex USB device + capabilities such as multiple configurations, multiple interfaces, + composite devices, + and alternate interface settings. + </para></listitem> + <listitem><para>USB "On-The-Go" (OTG) support, in conjunction + with updates to the Linux-USB host side. + </para></listitem> + <listitem><para>Sharing data structures and API models with the + Linux-USB host side API. This helps the OTG support, and + looks forward to more-symmetric frameworks (where the same + I/O model is used by both host and device side drivers). + </para></listitem> + <listitem><para>Minimalist, so it's easier to support new device + controller hardware. I/O processing doesn't imply large + demands for memory or CPU resources. + </para></listitem> +</itemizedlist> + + +<para>Most Linux developers will not be able to use this API, since they +have USB "host" hardware in a PC, workstation, or server. +Linux users with embedded systems are more likely to +have USB peripheral hardware. +To distinguish drivers running inside such hardware from the +more familiar Linux "USB device drivers", +which are host side proxies for the real USB devices, +a different term is used: +the drivers inside the peripherals are "USB gadget drivers". +In USB protocol interactions, the device driver is the master +(or "client driver") +and the gadget driver is the slave (or "function driver"). +</para> + +<para>The gadget API resembles the host side Linux-USB API in that both +use queues of request objects to package I/O buffers, and those requests +may be submitted or canceled. +They share common definitions for the standard USB +<emphasis>Chapter 9</emphasis> messages, structures, and constants. +Also, both APIs bind and unbind drivers to devices. +The APIs differ in detail, since the host side's current +URB framework exposes a number of implementation details +and assumptions that are inappropriate for a gadget API. +While the model for control transfers and configuration +management is necessarily different (one side is a hardware-neutral master, +the other is a hardware-aware slave), the endpoint I/0 API used here +should also be usable for an overhead-reduced host side API. +</para> + +</chapter> + +<chapter id="structure"><title>Structure of Gadget Drivers</title> + +<para>A system running inside a USB peripheral +normally has at least three layers inside the kernel to handle +USB protocol processing, and may have additional layers in +user space code. +The "gadget" API is used by the middle layer to interact +with the lowest level (which directly handles hardware). +</para> + +<para>In Linux, from the bottom up, these layers are: +</para> + +<variablelist> + + <varlistentry> + <term><emphasis>USB Controller Driver</emphasis></term> + + <listitem> + <para>This is the lowest software level. + It is the only layer that talks to hardware, + through registers, fifos, dma, irqs, and the like. + The <filename><linux/usb_gadget.h></filename> API abstracts + the peripheral controller endpoint hardware. + That hardware is exposed through endpoint objects, which accept + streams of IN/OUT buffers, and through callbacks that interact + with gadget drivers. + Since normal USB devices only have one upstream + port, they only have one of these drivers. + The controller driver can support any number of different + gadget drivers, but only one of them can be used at a time. + </para> + + <para>Examples of such controller hardware include + the PCI-based NetChip 2280 USB 2.0 high speed controller, + the SA-11x0 or PXA-25x UDC (found within many PDAs), + and a variety of other products. + </para> + + </listitem></varlistentry> + + <varlistentry> + <term><emphasis>Gadget Driver</emphasis></term> + + <listitem> + <para>The lower boundary of this driver implements hardware-neutral + USB functions, using calls to the controller driver. + Because such hardware varies widely in capabilities and restrictions, + and is used in embedded environments where space is at a premium, + the gadget driver is often configured at compile time + to work with endpoints supported by one particular controller. + Gadget drivers may be portable to several different controllers, + using conditional compilation. + (Recent kernels substantially simplify the work involved in + supporting new hardware, by <emphasis>autoconfiguring</emphasis> + endpoints automatically for many bulk-oriented drivers.) + Gadget driver responsibilities include: + </para> + <itemizedlist> + <listitem><para>handling setup requests (ep0 protocol responses) + possibly including class-specific functionality + </para></listitem> + <listitem><para>returning configuration and string descriptors + </para></listitem> + <listitem><para>(re)setting configurations and interface + altsettings, including enabling and configuring endpoints + </para></listitem> + <listitem><para>handling life cycle events, such as managing + bindings to hardware, + USB suspend/resume, remote wakeup, + and disconnection from the USB host. + </para></listitem> + <listitem><para>managing IN and OUT transfers on all currently + enabled endpoints + </para></listitem> + </itemizedlist> + + <para> + Such drivers may be modules of proprietary code, although + that approach is discouraged in the Linux community. + </para> + </listitem></varlistentry> + + <varlistentry> + <term><emphasis>Upper Level</emphasis></term> + + <listitem> + <para>Most gadget drivers have an upper boundary that connects + to some Linux driver or framework in Linux. + Through that boundary flows the data which the gadget driver + produces and/or consumes through protocol transfers over USB. + Examples include: + </para> + <itemizedlist> + <listitem><para>user mode code, using generic (gadgetfs) + or application specific files in + <filename>/dev</filename> + </para></listitem> + <listitem><para>networking subsystem (for network gadgets, + like the CDC Ethernet Model gadget driver) + </para></listitem> + <listitem><para>data capture drivers, perhaps video4Linux or + a scanner driver; or test and measurement hardware. + </para></listitem> + <listitem><para>input subsystem (for HID gadgets) + </para></listitem> + <listitem><para>sound subsystem (for audio gadgets) + </para></listitem> + <listitem><para>file system (for PTP gadgets) + </para></listitem> + <listitem><para>block i/o subsystem (for usb-storage gadgets) + </para></listitem> + <listitem><para>... and more </para></listitem> + </itemizedlist> + </listitem></varlistentry> + + <varlistentry> + <term><emphasis>Additional Layers</emphasis></term> + + <listitem> + <para>Other layers may exist. + These could include kernel layers, such as network protocol stacks, + as well as user mode applications building on standard POSIX + system call APIs such as + <emphasis>open()</emphasis>, <emphasis>close()</emphasis>, + <emphasis>read()</emphasis> and <emphasis>write()</emphasis>. + On newer systems, POSIX Async I/O calls may be an option. + Such user mode code will not necessarily be subject to + the GNU General Public License (GPL). + </para> + </listitem></varlistentry> + + +</variablelist> + +<para>OTG-capable systems will also need to include a standard Linux-USB +host side stack, +with <emphasis>usbcore</emphasis>, +one or more <emphasis>Host Controller Drivers</emphasis> (HCDs), +<emphasis>USB Device Drivers</emphasis> to support +the OTG "Targeted Peripheral List", +and so forth. +There will also be an <emphasis>OTG Controller Driver</emphasis>, +which is visible to gadget and device driver developers only indirectly. +That helps the host and device side USB controllers implement the +two new OTG protocols (HNP and SRP). +Roles switch (host to peripheral, or vice versa) using HNP +during USB suspend processing, and SRP can be viewed as a +more battery-friendly kind of device wakeup protocol. +</para> + +<para>Over time, reusable utilities are evolving to help make some +gadget driver tasks simpler. +For example, building configuration descriptors from vectors of +descriptors for the configurations interfaces and endpoints is +now automated, and many drivers now use autoconfiguration to +choose hardware endpoints and initialize their descriptors. + +A potential example of particular interest +is code implementing standard USB-IF protocols for +HID, networking, storage, or audio classes. +Some developers are interested in KDB or KGDB hooks, to let +target hardware be remotely debugged. +Most such USB protocol code doesn't need to be hardware-specific, +any more than network protocols like X11, HTTP, or NFS are. +Such gadget-side interface drivers should eventually be combined, +to implement composite devices. +</para> + +</chapter> + + +<chapter id="api"><title>Kernel Mode Gadget API</title> + +<para>Gadget drivers declare themselves through a +<emphasis>struct usb_gadget_driver</emphasis>, which is responsible for +most parts of enumeration for a <emphasis>struct usb_gadget</emphasis>. +The response to a set_configuration usually involves +enabling one or more of the <emphasis>struct usb_ep</emphasis> objects +exposed by the gadget, and submitting one or more +<emphasis>struct usb_request</emphasis> buffers to transfer data. +Understand those four data types, and their operations, and +you will understand how this API works. +</para> + +<note><title>Incomplete Data Type Descriptions</title> + +<para>This documentation was prepared using the standard Linux +kernel <filename>docproc</filename> tool, which turns text +and in-code comments into SGML DocBook and then into usable +formats such as HTML or PDF. +Other than the "Chapter 9" data types, most of the significant +data types and functions are described here. +</para> + +<para>However, docproc does not understand all the C constructs +that are used, so some relevant information is likely omitted from +what you are reading. +One example of such information is endpoint autoconfiguration. +You'll have to read the header file, and use example source +code (such as that for "Gadget Zero"), to fully understand the API. +</para> + +<para>The part of the API implementing some basic +driver capabilities is specific to the version of the +Linux kernel that's in use. +The 2.6 kernel includes a <emphasis>driver model</emphasis> +framework that has no analogue on earlier kernels; +so those parts of the gadget API are not fully portable. +(They are implemented on 2.4 kernels, but in a different way.) +The driver model state is another part of this API that is +ignored by the kerneldoc tools. +</para> +</note> + +<para>The core API does not expose +every possible hardware feature, only the most widely available ones. +There are significant hardware features, such as device-to-device DMA +(without temporary storage in a memory buffer) +that would be added using hardware-specific APIs. +</para> + +<para>This API allows drivers to use conditional compilation to handle +endpoint capabilities of different hardware, but doesn't require that. +Hardware tends to have arbitrary restrictions, relating to +transfer types, addressing, packet sizes, buffering, and availability. +As a rule, such differences only matter for "endpoint zero" logic +that handles device configuration and management. +The API supports limited run-time +detection of capabilities, through naming conventions for endpoints. +Many drivers will be able to at least partially autoconfigure +themselves. +In particular, driver init sections will often have endpoint +autoconfiguration logic that scans the hardware's list of endpoints +to find ones matching the driver requirements +(relying on those conventions), to eliminate some of the most +common reasons for conditional compilation. +</para> + +<para>Like the Linux-USB host side API, this API exposes +the "chunky" nature of USB messages: I/O requests are in terms +of one or more "packets", and packet boundaries are visible to drivers. +Compared to RS-232 serial protocols, USB resembles +synchronous protocols like HDLC +(N bytes per frame, multipoint addressing, host as the primary +station and devices as secondary stations) +more than asynchronous ones +(tty style: 8 data bits per frame, no parity, one stop bit). +So for example the controller drivers won't buffer +two single byte writes into a single two-byte USB IN packet, +although gadget drivers may do so when they implement +protocols where packet boundaries (and "short packets") +are not significant. +</para> + +<sect1 id="lifecycle"><title>Driver Life Cycle</title> + +<para>Gadget drivers make endpoint I/O requests to hardware without +needing to know many details of the hardware, but driver +setup/configuration code needs to handle some differences. +Use the API like this: +</para> + +<orderedlist numeration='arabic'> + +<listitem><para>Register a driver for the particular device side +usb controller hardware, +such as the net2280 on PCI (USB 2.0), +sa11x0 or pxa25x as found in Linux PDAs, +and so on. +At this point the device is logically in the USB ch9 initial state +("attached"), drawing no power and not usable +(since it does not yet support enumeration). +Any host should not see the device, since it's not +activated the data line pullup used by the host to +detect a device, even if VBUS power is available. +</para></listitem> + +<listitem><para>Register a gadget driver that implements some higher level +device function. That will then bind() to a usb_gadget, which +activates the data line pullup sometime after detecting VBUS. +</para></listitem> + +<listitem><para>The hardware driver can now start enumerating. +The steps it handles are to accept USB power and set_address requests. +Other steps are handled by the gadget driver. +If the gadget driver module is unloaded before the host starts to +enumerate, steps before step 7 are skipped. +</para></listitem> + +<listitem><para>The gadget driver's setup() call returns usb descriptors, +based both on what the bus interface hardware provides and on the +functionality being implemented. +That can involve alternate settings or configurations, +unless the hardware prevents such operation. +For OTG devices, each configuration descriptor includes +an OTG descriptor. +</para></listitem> + +<listitem><para>The gadget driver handles the last step of enumeration, +when the USB host issues a set_configuration call. +It enables all endpoints used in that configuration, +with all interfaces in their default settings. +That involves using a list of the hardware's endpoints, enabling each +endpoint according to its descriptor. +It may also involve using <function>usb_gadget_vbus_draw</function> +to let more power be drawn from VBUS, as allowed by that configuration. +For OTG devices, setting a configuration may also involve reporting +HNP capabilities through a user interface. +</para></listitem> + +<listitem><para>Do real work and perform data transfers, possibly involving +changes to interface settings or switching to new configurations, until the +device is disconnect()ed from the host. +Queue any number of transfer requests to each endpoint. +It may be suspended and resumed several times before being disconnected. +On disconnect, the drivers go back to step 3 (above). +</para></listitem> + +<listitem><para>When the gadget driver module is being unloaded, +the driver unbind() callback is issued. That lets the controller +driver be unloaded. +</para></listitem> + +</orderedlist> + +<para>Drivers will normally be arranged so that just loading the +gadget driver module (or statically linking it into a Linux kernel) +allows the peripheral device to be enumerated, but some drivers +will defer enumeration until some higher level component (like +a user mode daemon) enables it. +Note that at this lowest level there are no policies about how +ep0 configuration logic is implemented, +except that it should obey USB specifications. +Such issues are in the domain of gadget drivers, +including knowing about implementation constraints +imposed by some USB controllers +or understanding that composite devices might happen to +be built by integrating reusable components. +</para> + +<para>Note that the lifecycle above can be slightly different +for OTG devices. +Other than providing an additional OTG descriptor in each +configuration, only the HNP-related differences are particularly +visible to driver code. +They involve reporting requirements during the SET_CONFIGURATION +request, and the option to invoke HNP during some suspend callbacks. +Also, SRP changes the semantics of +<function>usb_gadget_wakeup</function> +slightly. +</para> + +</sect1> + +<sect1 id="ch9"><title>USB 2.0 Chapter 9 Types and Constants</title> + +<para>Gadget drivers +rely on common USB structures and constants +defined in the +<filename><linux/usb_ch9.h></filename> +header file, which is standard in Linux 2.6 kernels. +These are the same types and constants used by host +side drivers (and usbcore). +</para> + +!Iinclude/linux/usb_ch9.h +</sect1> + +<sect1 id="core"><title>Core Objects and Methods</title> + +<para>These are declared in +<filename><linux/usb_gadget.h></filename>, +and are used by gadget drivers to interact with +USB peripheral controller drivers. +</para> + + <!-- yeech, this is ugly in nsgmls PDF output. + + the PDF bookmark and refentry output nesting is wrong, + and the member/argument documentation indents ugly. + + plus something (docproc?) adds whitespace before the + descriptive paragraph text, so it can't line up right + unless the explanations are trivial. + --> + +!Iinclude/linux/usb_gadget.h +</sect1> + +<sect1 id="utils"><title>Optional Utilities</title> + +<para>The core API is sufficient for writing a USB Gadget Driver, +but some optional utilities are provided to simplify common tasks. +These utilities include endpoint autoconfiguration. +</para> + +!Edrivers/usb/gadget/usbstring.c +!Edrivers/usb/gadget/config.c +<!-- !Edrivers/usb/gadget/epautoconf.c --> +</sect1> + +</chapter> + +<chapter id="controllers"><title>Peripheral Controller Drivers</title> + +<para>The first hardware supporting this API was the NetChip 2280 +controller, which supports USB 2.0 high speed and is based on PCI. +This is the <filename>net2280</filename> driver module. +The driver supports Linux kernel versions 2.4 and 2.6; +contact NetChip Technologies for development boards and product +information. +</para> + +<para>Other hardware working in the "gadget" framework includes: +Intel's PXA 25x and IXP42x series processors +(<filename>pxa2xx_udc</filename>), +Toshiba TC86c001 "Goku-S" (<filename>goku_udc</filename>), +Renesas SH7705/7727 (<filename>sh_udc</filename>), +MediaQ 11xx (<filename>mq11xx_udc</filename>), +Hynix HMS30C7202 (<filename>h7202_udc</filename>), +National 9303/4 (<filename>n9604_udc</filename>), +Texas Instruments OMAP (<filename>omap_udc</filename>), +Sharp LH7A40x (<filename>lh7a40x_udc</filename>), +and more. +Most of those are full speed controllers. +</para> + +<para>At this writing, there are people at work on drivers in +this framework for several other USB device controllers, +with plans to make many of them be widely available. +</para> + +<!-- !Edrivers/usb/gadget/net2280.c --> + +<para>A partial USB simulator, +the <filename>dummy_hcd</filename> driver, is available. +It can act like a net2280, a pxa25x, or an sa11x0 in terms +of available endpoints and device speeds; and it simulates +control, bulk, and to some extent interrupt transfers. +That lets you develop some parts of a gadget driver on a normal PC, +without any special hardware, and perhaps with the assistance +of tools such as GDB running with User Mode Linux. +At least one person has expressed interest in adapting that +approach, hooking it up to a simulator for a microcontroller. +Such simulators can help debug subsystems where the runtime hardware +is unfriendly to software development, or is not yet available. +</para> + +<para>Support for other controllers is expected to be developed +and contributed +over time, as this driver framework evolves. +</para> + +</chapter> + +<chapter id="gadget"><title>Gadget Drivers</title> + +<para>In addition to <emphasis>Gadget Zero</emphasis> +(used primarily for testing and development with drivers +for usb controller hardware), other gadget drivers exist. +</para> + +<para>There's an <emphasis>ethernet</emphasis> gadget +driver, which implements one of the most useful +<emphasis>Communications Device Class</emphasis> (CDC) models. +One of the standards for cable modem interoperability even +specifies the use of this ethernet model as one of two +mandatory options. +Gadgets using this code look to a USB host as if they're +an Ethernet adapter. +It provides access to a network where the gadget's CPU is one host, +which could easily be bridging, routing, or firewalling +access to other networks. +Since some hardware can't fully implement the CDC Ethernet +requirements, this driver also implements a "good parts only" +subset of CDC Ethernet. +(That subset doesn't advertise itself as CDC Ethernet, +to avoid creating problems.) +</para> + +<para>Support for Microsoft's <emphasis>RNDIS</emphasis> +protocol has been contributed by Pengutronix and Auerswald GmbH. +This is like CDC Ethernet, but it runs on more slightly USB hardware +(but less than the CDC subset). +However, its main claim to fame is being able to connect directly to +recent versions of Windows, using drivers that Microsoft bundles +and supports, making it much simpler to network with Windows. +</para> + +<para>There is also support for user mode gadget drivers, +using <emphasis>gadgetfs</emphasis>. +This provides a <emphasis>User Mode API</emphasis> that presents +each endpoint as a single file descriptor. I/O is done using +normal <emphasis>read()</emphasis> and <emphasis>read()</emphasis> calls. +Familiar tools like GDB and pthreads can be used to +develop and debug user mode drivers, so that once a robust +controller driver is available many applications for it +won't require new kernel mode software. +Linux 2.6 <emphasis>Async I/O (AIO)</emphasis> +support is available, so that user mode software +can stream data with only slightly more overhead +than a kernel driver. +</para> + +<para>There's a USB Mass Storage class driver, which provides +a different solution for interoperability with systems such +as MS-Windows and MacOS. +That <emphasis>File-backed Storage</emphasis> driver uses a +file or block device as backing store for a drive, +like the <filename>loop</filename> driver. +The USB host uses the BBB, CB, or CBI versions of the mass +storage class specification, using transparent SCSI commands +to access the data from the backing store. +</para> + +<para>There's a "serial line" driver, useful for TTY style +operation over USB. +The latest version of that driver supports CDC ACM style +operation, like a USB modem, and so on most hardware it can +interoperate easily with MS-Windows. +One interesting use of that driver is in boot firmware (like a BIOS), +which can sometimes use that model with very small systems without +real serial lines. +</para> + +<para>Support for other kinds of gadget is expected to +be developed and contributed +over time, as this driver framework evolves. +</para> + +</chapter> + +<chapter id="otg"><title>USB On-The-GO (OTG)</title> + +<para>USB OTG support on Linux 2.6 was initially developed +by Texas Instruments for +<ulink url="http://www.omap.com">OMAP</ulink> 16xx and 17xx +series processors. +Other OTG systems should work in similar ways, but the +hardware level details could be very different. +</para> + +<para>Systems need specialized hardware support to implement OTG, +notably including a special <emphasis>Mini-AB</emphasis> jack +and associated transciever to support <emphasis>Dual-Role</emphasis> +operation: +they can act either as a host, using the standard +Linux-USB host side driver stack, +or as a peripheral, using this "gadget" framework. +To do that, the system software relies on small additions +to those programming interfaces, +and on a new internal component (here called an "OTG Controller") +affecting which driver stack connects to the OTG port. +In each role, the system can re-use the existing pool of +hardware-neutral drivers, layered on top of the controller +driver interfaces (<emphasis>usb_bus</emphasis> or +<emphasis>usb_gadget</emphasis>). +Such drivers need at most minor changes, and most of the calls +added to support OTG can also benefit non-OTG products. +</para> + +<itemizedlist> + <listitem><para>Gadget drivers test the <emphasis>is_otg</emphasis> + flag, and use it to determine whether or not to include + an OTG descriptor in each of their configurations. + </para></listitem> + <listitem><para>Gadget drivers may need changes to support the + two new OTG protocols, exposed in new gadget attributes + such as <emphasis>b_hnp_enable</emphasis> flag. + HNP support should be reported through a user interface + (two LEDs could suffice), and is triggered in some cases + when the host suspends the peripheral. + SRP support can be user-initiated just like remote wakeup, + probably by pressing the same button. + </para></listitem> + <listitem><para>On the host side, USB device drivers need + to be taught to trigger HNP at appropriate moments, using + <function>usb_suspend_device()</function>. + That also conserves battery power, which is useful even + for non-OTG configurations. + </para></listitem> + <listitem><para>Also on the host side, a driver must support the + OTG "Targeted Peripheral List". That's just a whitelist, + used to reject peripherals not supported with a given + Linux OTG host. + <emphasis>This whitelist is product-specific; + each product must modify <filename>otg_whitelist.h</filename> + to match its interoperability specification. + </emphasis> + </para> + <para>Non-OTG Linux hosts, like PCs and workstations, + normally have some solution for adding drivers, so that + peripherals that aren't recognized can eventually be supported. + That approach is unreasonable for consumer products that may + never have their firmware upgraded, and where it's usually + unrealistic to expect traditional PC/workstation/server kinds + of support model to work. + For example, it's often impractical to change device firmware + once the product has been distributed, so driver bugs can't + normally be fixed if they're found after shipment. + </para></listitem> +</itemizedlist> + +<para> +Additional changes are needed below those hardware-neutral +<emphasis>usb_bus</emphasis> and <emphasis>usb_gadget</emphasis> +driver interfaces; those aren't discussed here in any detail. +Those affect the hardware-specific code for each USB Host or Peripheral +controller, and how the HCD initializes (since OTG can be active only +on a single port). +They also involve what may be called an <emphasis>OTG Controller +Driver</emphasis>, managing the OTG transceiver and the OTG state +machine logic as well as much of the root hub behavior for the +OTG port. +The OTG controller driver needs to activate and deactivate USB +controllers depending on the relevant device role. +Some related changes were needed inside usbcore, so that it +can identify OTG-capable devices and respond appropriately +to HNP or SRP protocols. +</para> + +</chapter> + +</book> +<!-- + vim:syntax=sgml:sw=4 +--> diff --git a/Documentation/DocBook/journal-api.tmpl b/Documentation/DocBook/journal-api.tmpl new file mode 100644 index 000000000000..1ef6f43c6d8f --- /dev/null +++ b/Documentation/DocBook/journal-api.tmpl @@ -0,0 +1,333 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="LinuxJBDAPI"> + <bookinfo> + <title>The Linux Journalling API</title> + <authorgroup> + <author> + <firstname>Roger</firstname> + <surname>Gammans</surname> + <affiliation> + <address> + <email>rgammans@computer-surgery.co.uk</email> + </address> + </affiliation> + </author> + </authorgroup> + + <authorgroup> + <author> + <firstname>Stephen</firstname> + <surname>Tweedie</surname> + <affiliation> + <address> + <email>sct@redhat.com</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2002</year> + <holder>Roger Gammans</holder> + </copyright> + +<legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="Overview"> + <title>Overview</title> + <sect1> + <title>Details</title> +<para> +The journalling layer is easy to use. You need to +first of all create a journal_t data structure. There are +two calls to do this dependent on how you decide to allocate the physical +media on which the journal resides. The journal_init_inode() call +is for journals stored in filesystem inodes, or the journal_init_dev() +call can be use for journal stored on a raw device (in a continuous range +of blocks). A journal_t is a typedef for a struct pointer, so when +you are finally finished make sure you call journal_destroy() on it +to free up any used kernel memory. +</para> + +<para> +Once you have got your journal_t object you need to 'mount' or load the journal +file, unless of course you haven't initialised it yet - in which case you +need to call journal_create(). +</para> + +<para> +Most of the time however your journal file will already have been created, but +before you load it you must call journal_wipe() to empty the journal file. +Hang on, you say , what if the filesystem wasn't cleanly umount()'d . Well, it is the +job of the client file system to detect this and skip the call to journal_wipe(). +</para> + +<para> +In either case the next call should be to journal_load() which prepares the +journal file for use. Note that journal_wipe(..,0) calls journal_skip_recovery() +for you if it detects any outstanding transactions in the journal and similarly +journal_load() will call journal_recover() if necessary. +I would advise reading fs/ext3/super.c for examples on this stage. +[RGG: Why is the journal_wipe() call necessary - doesn't this needlessly +complicate the API. Or isn't a good idea for the journal layer to hide +dirty mounts from the client fs] +</para> + +<para> +Now you can go ahead and start modifying the underlying +filesystem. Almost. +</para> + + +<para> + +You still need to actually journal your filesystem changes, this +is done by wrapping them into transactions. Additionally you +also need to wrap the modification of each of the the buffers +with calls to the journal layer, so it knows what the modifications +you are actually making are. To do this use journal_start() which +returns a transaction handle. +</para> + +<para> +journal_start() +and its counterpart journal_stop(), which indicates the end of a transaction +are nestable calls, so you can reenter a transaction if necessary, +but remember you must call journal_stop() the same number of times as +journal_start() before the transaction is completed (or more accurately +leaves the the update phase). Ext3/VFS makes use of this feature to simplify +quota support. +</para> + +<para> +Inside each transaction you need to wrap the modifications to the +individual buffers (blocks). Before you start to modify a buffer you +need to call journal_get_{create,write,undo}_access() as appropriate, +this allows the journalling layer to copy the unmodified data if it +needs to. After all the buffer may be part of a previously uncommitted +transaction. +At this point you are at last ready to modify a buffer, and once +you are have done so you need to call journal_dirty_{meta,}data(). +Or if you've asked for access to a buffer you now know is now longer +required to be pushed back on the device you can call journal_forget() +in much the same way as you might have used bforget() in the past. +</para> + +<para> +A journal_flush() may be called at any time to commit and checkpoint +all your transactions. +</para> + +<para> +Then at umount time , in your put_super() (2.4) or write_super() (2.5) +you can then call journal_destroy() to clean up your in-core journal object. +</para> + + +<para> +Unfortunately there a couple of ways the journal layer can cause a deadlock. +The first thing to note is that each task can only have +a single outstanding transaction at any one time, remember nothing +commits until the outermost journal_stop(). This means +you must complete the transaction at the end of each file/inode/address +etc. operation you perform, so that the journalling system isn't re-entered +on another journal. Since transactions can't be nested/batched +across differing journals, and another filesystem other than +yours (say ext3) may be modified in a later syscall. +</para> + +<para> +The second case to bear in mind is that journal_start() can +block if there isn't enough space in the journal for your transaction +(based on the passed nblocks param) - when it blocks it merely(!) needs to +wait for transactions to complete and be committed from other tasks, +so essentially we are waiting for journal_stop(). So to avoid +deadlocks you must treat journal_start/stop() as if they +were semaphores and include them in your semaphore ordering rules to prevent +deadlocks. Note that journal_extend() has similar blocking behaviour to +journal_start() so you can deadlock here just as easily as on journal_start(). +</para> + +<para> +Try to reserve the right number of blocks the first time. ;-). This will +be the maximum number of blocks you are going to touch in this transaction. +I advise having a look at at least ext3_jbd.h to see the basis on which +ext3 uses to make these decisions. +</para> + +<para> +Another wriggle to watch out for is your on-disk block allocation strategy. +why? Because, if you undo a delete, you need to ensure you haven't reused any +of the freed blocks in a later transaction. One simple way of doing this +is make sure any blocks you allocate only have checkpointed transactions +listed against them. Ext3 does this in ext3_test_allocatable(). +</para> + +<para> +Lock is also providing through journal_{un,}lock_updates(), +ext3 uses this when it wants a window with a clean and stable fs for a moment. +eg. +</para> + +<programlisting> + + journal_lock_updates() //stop new stuff happening.. + journal_flush() // checkpoint everything. + ..do stuff on stable fs + journal_unlock_updates() // carry on with filesystem use. +</programlisting> + +<para> +The opportunities for abuse and DOS attacks with this should be obvious, +if you allow unprivileged userspace to trigger codepaths containing these +calls. +</para> + +<para> +A new feature of jbd since 2.5.25 is commit callbacks with the new +journal_callback_set() function you can now ask the journalling layer +to call you back when the transaction is finally committed to disk, so that +you can do some of your own management. The key to this is the journal_callback +struct, this maintains the internal callback information but you can +extend it like this:- +</para> +<programlisting> + struct myfs_callback_s { + //Data structure element required by jbd.. + struct journal_callback for_jbd; + // Stuff for myfs allocated together. + myfs_inode* i_commited; + + } +</programlisting> + +<para> +this would be useful if you needed to know when data was committed to a +particular inode. +</para> + +</sect1> + +<sect1> +<title>Summary</title> +<para> +Using the journal is a matter of wrapping the different context changes, +being each mount, each modification (transaction) and each changed buffer +to tell the journalling layer about them. +</para> + +<para> +Here is a some pseudo code to give you an idea of how it works, as +an example. +</para> + +<programlisting> + journal_t* my_jnrl = journal_create(); + journal_init_{dev,inode}(jnrl,...) + if (clean) journal_wipe(); + journal_load(); + + foreach(transaction) { /*transactions must be + completed before + a syscall returns to + userspace*/ + + handle_t * xct=journal_start(my_jnrl); + foreach(bh) { + journal_get_{create,write,undo}_access(xact,bh); + if ( myfs_modify(bh) ) { /* returns true + if makes changes */ + journal_dirty_{meta,}data(xact,bh); + } else { + journal_forget(bh); + } + } + journal_stop(xct); + } + journal_destroy(my_jrnl); +</programlisting> +</sect1> + +</chapter> + + <chapter id="adt"> + <title>Data Types</title> + <para> + The journalling layer uses typedefs to 'hide' the concrete definitions + of the structures used. As a client of the JBD layer you can + just rely on the using the pointer as a magic cookie of some sort. + + Obviously the hiding is not enforced as this is 'C'. + </para> + <sect1><title>Structures</title> +!Iinclude/linux/jbd.h + </sect1> +</chapter> + + <chapter id="calls"> + <title>Functions</title> + <para> + The functions here are split into two groups those that + affect a journal as a whole, and those which are used to + manage transactions +</para> + <sect1><title>Journal Level</title> +!Efs/jbd/journal.c +!Efs/jbd/recovery.c + </sect1> + <sect1><title>Transasction Level</title> +!Efs/jbd/transaction.c + </sect1> +</chapter> +<chapter> + <title>See also</title> + <para> + <citation> + <ulink url="ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/journal-design.ps.gz"> + Journaling the Linux ext2fs Filesystem,LinuxExpo 98, Stephen Tweedie + </ulink> + </citation> + </para> + <para> + <citation> + <ulink url="http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html"> + Ext3 Journalling FileSystem , OLS 2000, Dr. Stephen Tweedie + </ulink> + </citation> + </para> +</chapter> + +</book> diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl new file mode 100644 index 000000000000..1bd20c860285 --- /dev/null +++ b/Documentation/DocBook/kernel-api.tmpl @@ -0,0 +1,342 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="LinuxKernelAPI"> + <bookinfo> + <title>The Linux Kernel API</title> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="Basics"> + <title>Driver Basics</title> + <sect1><title>Driver Entry and Exit points</title> +!Iinclude/linux/init.h + </sect1> + + <sect1><title>Atomic and pointer manipulation</title> +!Iinclude/asm-i386/atomic.h +!Iinclude/asm-i386/unaligned.h + </sect1> + +<!-- FIXME: + kernel/sched.c has no docs, which stuffs up the sgml. Comment + out until somebody adds docs. KAO + <sect1><title>Delaying, scheduling, and timer routines</title> +X!Ekernel/sched.c + </sect1> +KAO --> + </chapter> + + <chapter id="adt"> + <title>Data Types</title> + <sect1><title>Doubly Linked Lists</title> +!Iinclude/linux/list.h + </sect1> + </chapter> + + <chapter id="libc"> + <title>Basic C Library Functions</title> + + <para> + When writing drivers, you cannot in general use routines which are + from the C Library. Some of the functions have been found generally + useful and they are listed below. The behaviour of these functions + may vary slightly from those defined by ANSI, and these deviations + are noted in the text. + </para> + + <sect1><title>String Conversions</title> +!Ilib/vsprintf.c +!Elib/vsprintf.c + </sect1> + <sect1><title>String Manipulation</title> +!Ilib/string.c +!Elib/string.c + </sect1> + <sect1><title>Bit Operations</title> +!Iinclude/asm-i386/bitops.h + </sect1> + </chapter> + + <chapter id="mm"> + <title>Memory Management in Linux</title> + <sect1><title>The Slab Cache</title> +!Emm/slab.c + </sect1> + <sect1><title>User Space Memory Access</title> +!Iinclude/asm-i386/uaccess.h +!Iarch/i386/lib/usercopy.c + </sect1> + </chapter> + + <chapter id="kfifo"> + <title>FIFO Buffer</title> + <sect1><title>kfifo interface</title> +!Iinclude/linux/kfifo.h +!Ekernel/kfifo.c + </sect1> + </chapter> + + <chapter id="proc"> + <title>The proc filesystem</title> + + <sect1><title>sysctl interface</title> +!Ekernel/sysctl.c + </sect1> + </chapter> + + <chapter id="debugfs"> + <title>The debugfs filesystem</title> + + <sect1><title>debugfs interface</title> +!Efs/debugfs/inode.c +!Efs/debugfs/file.c + </sect1> + </chapter> + + <chapter id="vfs"> + <title>The Linux VFS</title> + <sect1><title>The Directory Cache</title> +!Efs/dcache.c +!Iinclude/linux/dcache.h + </sect1> + <sect1><title>Inode Handling</title> +!Efs/inode.c +!Efs/bad_inode.c + </sect1> + <sect1><title>Registration and Superblocks</title> +!Efs/super.c + </sect1> + <sect1><title>File Locks</title> +!Efs/locks.c +!Ifs/locks.c + </sect1> + </chapter> + + <chapter id="netcore"> + <title>Linux Networking</title> + <sect1><title>Socket Buffer Functions</title> +!Iinclude/linux/skbuff.h +!Enet/core/skbuff.c + </sect1> + <sect1><title>Socket Filter</title> +!Enet/core/filter.c + </sect1> + <sect1><title>Generic Network Statistics</title> +!Iinclude/linux/gen_stats.h +!Enet/core/gen_stats.c +!Enet/core/gen_estimator.c + </sect1> + </chapter> + + <chapter id="netdev"> + <title>Network device support</title> + <sect1><title>Driver Support</title> +!Enet/core/dev.c + </sect1> + <sect1><title>8390 Based Network Cards</title> +!Edrivers/net/8390.c + </sect1> + <sect1><title>Synchronous PPP</title> +!Edrivers/net/wan/syncppp.c + </sect1> + </chapter> + + <chapter id="modload"> + <title>Module Support</title> + <sect1><title>Module Loading</title> +!Ekernel/kmod.c + </sect1> + <sect1><title>Inter Module support</title> + <para> + Refer to the file kernel/module.c for more information. + </para> +<!-- FIXME: Removed for now since no structured comments in source +X!Ekernel/module.c +--> + </sect1> + </chapter> + + <chapter id="hardware"> + <title>Hardware Interfaces</title> + <sect1><title>Interrupt Handling</title> +!Iarch/i386/kernel/irq.c + </sect1> + + <sect1><title>MTRR Handling</title> +!Earch/i386/kernel/cpu/mtrr/main.c + </sect1> + <sect1><title>PCI Support Library</title> +!Edrivers/pci/pci.c + </sect1> + <sect1><title>PCI Hotplug Support Library</title> +!Edrivers/pci/hotplug/pci_hotplug_core.c + </sect1> + <sect1><title>MCA Architecture</title> + <sect2><title>MCA Device Functions</title> + <para> + Refer to the file arch/i386/kernel/mca.c for more information. + </para> +<!-- FIXME: Removed for now since no structured comments in source +X!Earch/i386/kernel/mca.c +--> + </sect2> + <sect2><title>MCA Bus DMA</title> +!Iinclude/asm-i386/mca_dma.h + </sect2> + </sect1> + </chapter> + + <chapter id="devfs"> + <title>The Device File System</title> +!Efs/devfs/base.c + </chapter> + + <chapter id="security"> + <title>Security Framework</title> +!Esecurity/security.c + </chapter> + + <chapter id="pmfuncs"> + <title>Power Management</title> +!Ekernel/power/pm.c + </chapter> + + <chapter id="blkdev"> + <title>Block Devices</title> +!Edrivers/block/ll_rw_blk.c + </chapter> + + <chapter id="miscdev"> + <title>Miscellaneous Devices</title> +!Edrivers/char/misc.c + </chapter> + + <chapter id="viddev"> + <title>Video4Linux</title> +!Edrivers/media/video/videodev.c + </chapter> + + <chapter id="snddev"> + <title>Sound Devices</title> +!Esound/sound_core.c +<!-- FIXME: Removed for now since no structured comments in source +X!Isound/sound_firmware.c +--> + </chapter> + + <chapter id="uart16x50"> + <title>16x50 UART Driver</title> +!Edrivers/serial/serial_core.c +!Edrivers/serial/8250.c + </chapter> + + <chapter id="z85230"> + <title>Z85230 Support Library</title> +!Edrivers/net/wan/z85230.c + </chapter> + + <chapter id="fbdev"> + <title>Frame Buffer Library</title> + + <para> + The frame buffer drivers depend heavily on four data structures. + These structures are declared in include/linux/fb.h. They are + fb_info, fb_var_screeninfo, fb_fix_screeninfo and fb_monospecs. + The last three can be made available to and from userland. + </para> + + <para> + fb_info defines the current state of a particular video card. + Inside fb_info, there exists a fb_ops structure which is a + collection of needed functions to make fbdev and fbcon work. + fb_info is only visible to the kernel. + </para> + + <para> + fb_var_screeninfo is used to describe the features of a video card + that are user defined. With fb_var_screeninfo, things such as + depth and the resolution may be defined. + </para> + + <para> + The next structure is fb_fix_screeninfo. This defines the + properties of a card that are created when a mode is set and can't + be changed otherwise. A good example of this is the start of the + frame buffer memory. This "locks" the address of the frame buffer + memory, so that it cannot be changed or moved. + </para> + + <para> + The last structure is fb_monospecs. In the old API, there was + little importance for fb_monospecs. This allowed for forbidden things + such as setting a mode of 800x600 on a fix frequency monitor. With + the new API, fb_monospecs prevents such things, and if used + correctly, can prevent a monitor from being cooked. fb_monospecs + will not be useful until kernels 2.5.x. + </para> + + <sect1><title>Frame Buffer Memory</title> +!Edrivers/video/fbmem.c + </sect1> + <sect1><title>Frame Buffer Console</title> +!Edrivers/video/console/fbcon.c + </sect1> + <sect1><title>Frame Buffer Colormap</title> +!Edrivers/video/fbcmap.c + </sect1> +<!-- FIXME: + drivers/video/fbgen.c has no docs, which stuffs up the sgml. Comment + out until somebody adds docs. KAO + <sect1><title>Frame Buffer Generic Functions</title> +X!Idrivers/video/fbgen.c + </sect1> +KAO --> + <sect1><title>Frame Buffer Video Mode Database</title> +!Idrivers/video/modedb.c +!Edrivers/video/modedb.c + </sect1> + <sect1><title>Frame Buffer Macintosh Video Mode Database</title> +!Idrivers/video/macmodes.c + </sect1> + <sect1><title>Frame Buffer Fonts</title> + <para> + Refer to the file drivers/video/console/fonts.c for more information. + </para> +<!-- FIXME: Removed for now since no structured comments in source +X!Idrivers/video/console/fonts.c +--> + </sect1> + </chapter> +</book> diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl new file mode 100644 index 000000000000..49a9ef82d575 --- /dev/null +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -0,0 +1,1349 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="lk-hacking-guide"> + <bookinfo> + <title>Unreliable Guide To Hacking The Linux Kernel</title> + + <authorgroup> + <author> + <firstname>Paul</firstname> + <othername>Rusty</othername> + <surname>Russell</surname> + <affiliation> + <address> + <email>rusty@rustcorp.com.au</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2001</year> + <holder>Rusty Russell</holder> + </copyright> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + + <releaseinfo> + This is the first release of this document as part of the kernel tarball. + </releaseinfo> + + </bookinfo> + + <toc></toc> + + <chapter id="introduction"> + <title>Introduction</title> + <para> + Welcome, gentle reader, to Rusty's Unreliable Guide to Linux + Kernel Hacking. This document describes the common routines and + general requirements for kernel code: its goal is to serve as a + primer for Linux kernel development for experienced C + programmers. I avoid implementation details: that's what the + code is for, and I ignore whole tracts of useful routines. + </para> + <para> + Before you read this, please understand that I never wanted to + write this document, being grossly under-qualified, but I always + wanted to read it, and this was the only way. I hope it will + grow into a compendium of best practice, common starting points + and random information. + </para> + </chapter> + + <chapter id="basic-players"> + <title>The Players</title> + + <para> + At any time each of the CPUs in a system can be: + </para> + + <itemizedlist> + <listitem> + <para> + not associated with any process, serving a hardware interrupt; + </para> + </listitem> + + <listitem> + <para> + not associated with any process, serving a softirq, tasklet or bh; + </para> + </listitem> + + <listitem> + <para> + running in kernel space, associated with a process; + </para> + </listitem> + + <listitem> + <para> + running a process in user space. + </para> + </listitem> + </itemizedlist> + + <para> + There is a strict ordering between these: other than the last + category (userspace) each can only be pre-empted by those above. + For example, while a softirq is running on a CPU, no other + softirq will pre-empt it, but a hardware interrupt can. However, + any other CPUs in the system execute independently. + </para> + + <para> + We'll see a number of ways that the user context can block + interrupts, to become truly non-preemptable. + </para> + + <sect1 id="basics-usercontext"> + <title>User Context</title> + + <para> + User context is when you are coming in from a system call or + other trap: you can sleep, and you own the CPU (except for + interrupts) until you call <function>schedule()</function>. + In other words, user context (unlike userspace) is not pre-emptable. + </para> + + <note> + <para> + You are always in user context on module load and unload, + and on operations on the block device layer. + </para> + </note> + + <para> + In user context, the <varname>current</varname> pointer (indicating + the task we are currently executing) is valid, and + <function>in_interrupt()</function> + (<filename>include/linux/interrupt.h</filename>) is <returnvalue>false + </returnvalue>. + </para> + + <caution> + <para> + Beware that if you have interrupts or bottom halves disabled + (see below), <function>in_interrupt()</function> will return a + false positive. + </para> + </caution> + </sect1> + + <sect1 id="basics-hardirqs"> + <title>Hardware Interrupts (Hard IRQs)</title> + + <para> + Timer ticks, <hardware>network cards</hardware> and + <hardware>keyboard</hardware> are examples of real + hardware which produce interrupts at any time. The kernel runs + interrupt handlers, which services the hardware. The kernel + guarantees that this handler is never re-entered: if another + interrupt arrives, it is queued (or dropped). Because it + disables interrupts, this handler has to be fast: frequently it + simply acknowledges the interrupt, marks a `software interrupt' + for execution and exits. + </para> + + <para> + You can tell you are in a hardware interrupt, because + <function>in_irq()</function> returns <returnvalue>true</returnvalue>. + </para> + <caution> + <para> + Beware that this will return a false positive if interrupts are disabled + (see below). + </para> + </caution> + </sect1> + + <sect1 id="basics-softirqs"> + <title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title> + + <para> + Whenever a system call is about to return to userspace, or a + hardware interrupt handler exits, any `software interrupts' + which are marked pending (usually by hardware interrupts) are + run (<filename>kernel/softirq.c</filename>). + </para> + + <para> + Much of the real interrupt handling work is done here. Early in + the transition to <acronym>SMP</acronym>, there were only `bottom + halves' (BHs), which didn't take advantage of multiple CPUs. Shortly + after we switched from wind-up computers made of match-sticks and snot, + we abandoned this limitation. + </para> + + <para> + <filename class="headerfile">include/linux/interrupt.h</filename> lists the + different BH's. No matter how many CPUs you have, no two BHs will run at + the same time. This made the transition to SMP simpler, but sucks hard for + scalable performance. A very important bottom half is the timer + BH (<filename class="headerfile">include/linux/timer.h</filename>): you + can register to have it call functions for you in a given length of time. + </para> + + <para> + 2.3.43 introduced softirqs, and re-implemented the (now + deprecated) BHs underneath them. Softirqs are fully-SMP + versions of BHs: they can run on as many CPUs at once as + required. This means they need to deal with any races in shared + data using their own locks. A bitmask is used to keep track of + which are enabled, so the 32 available softirqs should not be + used up lightly. (<emphasis>Yes</emphasis>, people will + notice). + </para> + + <para> + tasklets (<filename class="headerfile">include/linux/interrupt.h</filename>) + are like softirqs, except they are dynamically-registrable (meaning you + can have as many as you want), and they also guarantee that any tasklet + will only run on one CPU at any time, although different tasklets can + run simultaneously (unlike different BHs). + </para> + <caution> + <para> + The name `tasklet' is misleading: they have nothing to do with `tasks', + and probably more to do with some bad vodka Alexey Kuznetsov had at the + time. + </para> + </caution> + + <para> + You can tell you are in a softirq (or bottom half, or tasklet) + using the <function>in_softirq()</function> macro + (<filename class="headerfile">include/linux/interrupt.h</filename>). + </para> + <caution> + <para> + Beware that this will return a false positive if a bh lock (see below) + is held. + </para> + </caution> + </sect1> + </chapter> + + <chapter id="basic-rules"> + <title>Some Basic Rules</title> + + <variablelist> + <varlistentry> + <term>No memory protection</term> + <listitem> + <para> + If you corrupt memory, whether in user context or + interrupt context, the whole machine will crash. Are you + sure you can't do what you want in userspace? + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>No floating point or <acronym>MMX</acronym></term> + <listitem> + <para> + The <acronym>FPU</acronym> context is not saved; even in user + context the <acronym>FPU</acronym> state probably won't + correspond with the current process: you would mess with some + user process' <acronym>FPU</acronym> state. If you really want + to do this, you would have to explicitly save/restore the full + <acronym>FPU</acronym> state (and avoid context switches). It + is generally a bad idea; use fixed point arithmetic first. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>A rigid stack limit</term> + <listitem> + <para> + The kernel stack is about 6K in 2.2 (for most + architectures: it's about 14K on the Alpha), and shared + with interrupts so you can't use it all. Avoid deep + recursion and huge local arrays on the stack (allocate + them dynamically instead). + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>The Linux kernel is portable</term> + <listitem> + <para> + Let's keep it that way. Your code should be 64-bit clean, + and endian-independent. You should also minimize CPU + specific stuff, e.g. inline assembly should be cleanly + encapsulated and minimized to ease porting. Generally it + should be restricted to the architecture-dependent part of + the kernel tree. + </para> + </listitem> + </varlistentry> + </variablelist> + </chapter> + + <chapter id="ioctls"> + <title>ioctls: Not writing a new system call</title> + + <para> + A system call generally looks like this + </para> + + <programlisting> +asmlinkage long sys_mycall(int arg) +{ + return 0; +} + </programlisting> + + <para> + First, in most cases you don't want to create a new system call. + You create a character device and implement an appropriate ioctl + for it. This is much more flexible than system calls, doesn't have + to be entered in every architecture's + <filename class="headerfile">include/asm/unistd.h</filename> and + <filename>arch/kernel/entry.S</filename> file, and is much more + likely to be accepted by Linus. + </para> + + <para> + If all your routine does is read or write some parameter, consider + implementing a <function>sysctl</function> interface instead. + </para> + + <para> + Inside the ioctl you're in user context to a process. When a + error occurs you return a negated errno (see + <filename class="headerfile">include/linux/errno.h</filename>), + otherwise you return <returnvalue>0</returnvalue>. + </para> + + <para> + After you slept you should check if a signal occurred: the + Unix/Linux way of handling signals is to temporarily exit the + system call with the <constant>-ERESTARTSYS</constant> error. The + system call entry code will switch back to user context, process + the signal handler and then your system call will be restarted + (unless the user disabled that). So you should be prepared to + process the restart, e.g. if you're in the middle of manipulating + some data structure. + </para> + + <programlisting> +if (signal_pending()) + return -ERESTARTSYS; + </programlisting> + + <para> + If you're doing longer computations: first think userspace. If you + <emphasis>really</emphasis> want to do it in kernel you should + regularly check if you need to give up the CPU (remember there is + cooperative multitasking per CPU). Idiom: + </para> + + <programlisting> +cond_resched(); /* Will sleep */ + </programlisting> + + <para> + A short note on interface design: the UNIX system call motto is + "Provide mechanism not policy". + </para> + </chapter> + + <chapter id="deadlock-recipes"> + <title>Recipes for Deadlock</title> + + <para> + You cannot call any routines which may sleep, unless: + </para> + <itemizedlist> + <listitem> + <para> + You are in user context. + </para> + </listitem> + + <listitem> + <para> + You do not own any spinlocks. + </para> + </listitem> + + <listitem> + <para> + You have interrupts enabled (actually, Andi Kleen says + that the scheduling code will enable them for you, but + that's probably not what you wanted). + </para> + </listitem> + </itemizedlist> + + <para> + Note that some functions may sleep implicitly: common ones are + the user space access functions (*_user) and memory allocation + functions without <symbol>GFP_ATOMIC</symbol>. + </para> + + <para> + You will eventually lock up your box if you break these rules. + </para> + + <para> + Really. + </para> + </chapter> + + <chapter id="common-routines"> + <title>Common Routines</title> + + <sect1 id="routines-printk"> + <title> + <function>printk()</function> + <filename class="headerfile">include/linux/kernel.h</filename> + </title> + + <para> + <function>printk()</function> feeds kernel messages to the + console, dmesg, and the syslog daemon. It is useful for debugging + and reporting errors, and can be used inside interrupt context, + but use with caution: a machine which has its console flooded with + printk messages is unusable. It uses a format string mostly + compatible with ANSI C printf, and C string concatenation to give + it a first "priority" argument: + </para> + + <programlisting> +printk(KERN_INFO "i = %u\n", i); + </programlisting> + + <para> + See <filename class="headerfile">include/linux/kernel.h</filename>; + for other KERN_ values; these are interpreted by syslog as the + level. Special case: for printing an IP address use + </para> + + <programlisting> +__u32 ipaddress; +printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); + </programlisting> + + <para> + <function>printk()</function> internally uses a 1K buffer and does + not catch overruns. Make sure that will be enough. + </para> + + <note> + <para> + You will know when you are a real kernel hacker + when you start typoing printf as printk in your user programs :) + </para> + </note> + + <!--- From the Lions book reader department --> + + <note> + <para> + Another sidenote: the original Unix Version 6 sources had a + comment on top of its printf function: "Printf should not be + used for chit-chat". You should follow that advice. + </para> + </note> + </sect1> + + <sect1 id="routines-copy"> + <title> + <function>copy_[to/from]_user()</function> + / + <function>get_user()</function> + / + <function>put_user()</function> + <filename class="headerfile">include/asm/uaccess.h</filename> + </title> + + <para> + <emphasis>[SLEEPS]</emphasis> + </para> + + <para> + <function>put_user()</function> and <function>get_user()</function> + are used to get and put single values (such as an int, char, or + long) from and to userspace. A pointer into userspace should + never be simply dereferenced: data should be copied using these + routines. Both return <constant>-EFAULT</constant> or 0. + </para> + <para> + <function>copy_to_user()</function> and + <function>copy_from_user()</function> are more general: they copy + an arbitrary amount of data to and from userspace. + <caution> + <para> + Unlike <function>put_user()</function> and + <function>get_user()</function>, they return the amount of + uncopied data (ie. <returnvalue>0</returnvalue> still means + success). + </para> + </caution> + [Yes, this moronic interface makes me cringe. Please submit a + patch and become my hero --RR.] + </para> + <para> + The functions may sleep implicitly. This should never be called + outside user context (it makes no sense), with interrupts + disabled, or a spinlock held. + </para> + </sect1> + + <sect1 id="routines-kmalloc"> + <title><function>kmalloc()</function>/<function>kfree()</function> + <filename class="headerfile">include/linux/slab.h</filename></title> + + <para> + <emphasis>[MAY SLEEP: SEE BELOW]</emphasis> + </para> + + <para> + These routines are used to dynamically request pointer-aligned + chunks of memory, like malloc and free do in userspace, but + <function>kmalloc()</function> takes an extra flag word. + Important values: + </para> + + <variablelist> + <varlistentry> + <term> + <constant> + GFP_KERNEL + </constant> + </term> + <listitem> + <para> + May sleep and swap to free memory. Only allowed in user + context, but is the most reliable way to allocate memory. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term> + <constant> + GFP_ATOMIC + </constant> + </term> + <listitem> + <para> + Don't sleep. Less reliable than <constant>GFP_KERNEL</constant>, + but may be called from interrupt context. You should + <emphasis>really</emphasis> have a good out-of-memory + error-handling strategy. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term> + <constant> + GFP_DMA + </constant> + </term> + <listitem> + <para> + Allocate ISA DMA lower than 16MB. If you don't know what that + is you don't need it. Very unreliable. + </para> + </listitem> + </varlistentry> + </variablelist> + + <para> + If you see a <errorname>kmem_grow: Called nonatomically from int + </errorname> warning message you called a memory allocation function + from interrupt context without <constant>GFP_ATOMIC</constant>. + You should really fix that. Run, don't walk. + </para> + + <para> + If you are allocating at least <constant>PAGE_SIZE</constant> + (<filename class="headerfile">include/asm/page.h</filename>) bytes, + consider using <function>__get_free_pages()</function> + + (<filename class="headerfile">include/linux/mm.h</filename>). It + takes an order argument (0 for page sized, 1 for double page, 2 + for four pages etc.) and the same memory priority flag word as + above. + </para> + + <para> + If you are allocating more than a page worth of bytes you can use + <function>vmalloc()</function>. It'll allocate virtual memory in + the kernel map. This block is not contiguous in physical memory, + but the <acronym>MMU</acronym> makes it look like it is for you + (so it'll only look contiguous to the CPUs, not to external device + drivers). If you really need large physically contiguous memory + for some weird device, you have a problem: it is poorly supported + in Linux because after some time memory fragmentation in a running + kernel makes it hard. The best way is to allocate the block early + in the boot process via the <function>alloc_bootmem()</function> + routine. + </para> + + <para> + Before inventing your own cache of often-used objects consider + using a slab cache in + <filename class="headerfile">include/linux/slab.h</filename> + </para> + </sect1> + + <sect1 id="routines-current"> + <title><function>current</function> + <filename class="headerfile">include/asm/current.h</filename></title> + + <para> + This global variable (really a macro) contains a pointer to + the current task structure, so is only valid in user context. + For example, when a process makes a system call, this will + point to the task structure of the calling process. It is + <emphasis>not NULL</emphasis> in interrupt context. + </para> + </sect1> + + <sect1 id="routines-udelay"> + <title><function>udelay()</function>/<function>mdelay()</function> + <filename class="headerfile">include/asm/delay.h</filename> + <filename class="headerfile">include/linux/delay.h</filename> + </title> + + <para> + The <function>udelay()</function> function can be used for small pauses. + Do not use large values with <function>udelay()</function> as you risk + overflow - the helper function <function>mdelay()</function> is useful + here, or even consider <function>schedule_timeout()</function>. + </para> + </sect1> + + <sect1 id="routines-endian"> + <title><function>cpu_to_be32()</function>/<function>be32_to_cpu()</function>/<function>cpu_to_le32()</function>/<function>le32_to_cpu()</function> + <filename class="headerfile">include/asm/byteorder.h</filename> + </title> + + <para> + The <function>cpu_to_be32()</function> family (where the "32" can + be replaced by 64 or 16, and the "be" can be replaced by "le") are + the general way to do endian conversions in the kernel: they + return the converted value. All variations supply the reverse as + well: <function>be32_to_cpu()</function>, etc. + </para> + + <para> + There are two major variations of these functions: the pointer + variation, such as <function>cpu_to_be32p()</function>, which take + a pointer to the given type, and return the converted value. The + other variation is the "in-situ" family, such as + <function>cpu_to_be32s()</function>, which convert value referred + to by the pointer, and return void. + </para> + </sect1> + + <sect1 id="routines-local-irqs"> + <title><function>local_irq_save()</function>/<function>local_irq_restore()</function> + <filename class="headerfile">include/asm/system.h</filename> + </title> + + <para> + These routines disable hard interrupts on the local CPU, and + restore them. They are reentrant; saving the previous state in + their one <varname>unsigned long flags</varname> argument. If you + know that interrupts are enabled, you can simply use + <function>local_irq_disable()</function> and + <function>local_irq_enable()</function>. + </para> + </sect1> + + <sect1 id="routines-softirqs"> + <title><function>local_bh_disable()</function>/<function>local_bh_enable()</function> + <filename class="headerfile">include/linux/interrupt.h</filename></title> + + <para> + These routines disable soft interrupts on the local CPU, and + restore them. They are reentrant; if soft interrupts were + disabled before, they will still be disabled after this pair + of functions has been called. They prevent softirqs, tasklets + and bottom halves from running on the current CPU. + </para> + </sect1> + + <sect1 id="routines-processorids"> + <title><function>smp_processor_id</function>() + <filename class="headerfile">include/asm/smp.h</filename></title> + + <para> + <function>smp_processor_id()</function> returns the current + processor number, between 0 and <symbol>NR_CPUS</symbol> (the + maximum number of CPUs supported by Linux, currently 32). These + values are not necessarily continuous. + </para> + </sect1> + + <sect1 id="routines-init"> + <title><type>__init</type>/<type>__exit</type>/<type>__initdata</type> + <filename class="headerfile">include/linux/init.h</filename></title> + + <para> + After boot, the kernel frees up a special section; functions + marked with <type>__init</type> and data structures marked with + <type>__initdata</type> are dropped after boot is complete (within + modules this directive is currently ignored). <type>__exit</type> + is used to declare a function which is only required on exit: the + function will be dropped if this file is not compiled as a module. + See the header file for use. Note that it makes no sense for a function + marked with <type>__init</type> to be exported to modules with + <function>EXPORT_SYMBOL()</function> - this will break. + </para> + <para> + Static data structures marked as <type>__initdata</type> must be initialised + (as opposed to ordinary static data which is zeroed BSS) and cannot be + <type>const</type>. + </para> + + </sect1> + + <sect1 id="routines-init-again"> + <title><function>__initcall()</function>/<function>module_init()</function> + <filename class="headerfile">include/linux/init.h</filename></title> + <para> + Many parts of the kernel are well served as a module + (dynamically-loadable parts of the kernel). Using the + <function>module_init()</function> and + <function>module_exit()</function> macros it is easy to write code + without #ifdefs which can operate both as a module or built into + the kernel. + </para> + + <para> + The <function>module_init()</function> macro defines which + function is to be called at module insertion time (if the file is + compiled as a module), or at boot time: if the file is not + compiled as a module the <function>module_init()</function> macro + becomes equivalent to <function>__initcall()</function>, which + through linker magic ensures that the function is called on boot. + </para> + + <para> + The function can return a negative error number to cause + module loading to fail (unfortunately, this has no effect if + the module is compiled into the kernel). For modules, this is + called in user context, with interrupts enabled, and the + kernel lock held, so it can sleep. + </para> + </sect1> + + <sect1 id="routines-moduleexit"> + <title> <function>module_exit()</function> + <filename class="headerfile">include/linux/init.h</filename> </title> + + <para> + This macro defines the function to be called at module removal + time (or never, in the case of the file compiled into the + kernel). It will only be called if the module usage count has + reached zero. This function can also sleep, but cannot fail: + everything must be cleaned up by the time it returns. + </para> + </sect1> + + <!-- add info on new-style module refcounting here --> + </chapter> + + <chapter id="queues"> + <title>Wait Queues + <filename class="headerfile">include/linux/wait.h</filename> + </title> + <para> + <emphasis>[SLEEPS]</emphasis> + </para> + + <para> + A wait queue is used to wait for someone to wake you up when a + certain condition is true. They must be used carefully to ensure + there is no race condition. You declare a + <type>wait_queue_head_t</type>, and then processes which want to + wait for that condition declare a <type>wait_queue_t</type> + referring to themselves, and place that in the queue. + </para> + + <sect1 id="queue-declaring"> + <title>Declaring</title> + + <para> + You declare a <type>wait_queue_head_t</type> using the + <function>DECLARE_WAIT_QUEUE_HEAD()</function> macro, or using the + <function>init_waitqueue_head()</function> routine in your + initialization code. + </para> + </sect1> + + <sect1 id="queue-waitqueue"> + <title>Queuing</title> + + <para> + Placing yourself in the waitqueue is fairly complex, because you + must put yourself in the queue before checking the condition. + There is a macro to do this: + <function>wait_event_interruptible()</function> + + <filename class="headerfile">include/linux/sched.h</filename> The + first argument is the wait queue head, and the second is an + expression which is evaluated; the macro returns + <returnvalue>0</returnvalue> when this expression is true, or + <returnvalue>-ERESTARTSYS</returnvalue> if a signal is received. + The <function>wait_event()</function> version ignores signals. + </para> + <para> + Do not use the <function>sleep_on()</function> function family - + it is very easy to accidentally introduce races; almost certainly + one of the <function>wait_event()</function> family will do, or a + loop around <function>schedule_timeout()</function>. If you choose + to loop around <function>schedule_timeout()</function> remember + you must set the task state (with + <function>set_current_state()</function>) on each iteration to avoid + busy-looping. + </para> + + </sect1> + + <sect1 id="queue-waking"> + <title>Waking Up Queued Tasks</title> + + <para> + Call <function>wake_up()</function> + + <filename class="headerfile">include/linux/sched.h</filename>;, + which will wake up every process in the queue. The exception is + if one has <constant>TASK_EXCLUSIVE</constant> set, in which case + the remainder of the queue will not be woken. + </para> + </sect1> + </chapter> + + <chapter id="atomic-ops"> + <title>Atomic Operations</title> + + <para> + Certain operations are guaranteed atomic on all platforms. The + first class of operations work on <type>atomic_t</type> + + <filename class="headerfile">include/asm/atomic.h</filename>; this + contains a signed integer (at least 24 bits long), and you must use + these functions to manipulate or read atomic_t variables. + <function>atomic_read()</function> and + <function>atomic_set()</function> get and set the counter, + <function>atomic_add()</function>, + <function>atomic_sub()</function>, + <function>atomic_inc()</function>, + <function>atomic_dec()</function>, and + <function>atomic_dec_and_test()</function> (returns + <returnvalue>true</returnvalue> if it was decremented to zero). + </para> + + <para> + Yes. It returns <returnvalue>true</returnvalue> (i.e. != 0) if the + atomic variable is zero. + </para> + + <para> + Note that these functions are slower than normal arithmetic, and + so should not be used unnecessarily. On some platforms they + are much slower, like 32-bit Sparc where they use a spinlock. + </para> + + <para> + The second class of atomic operations is atomic bit operations on a + <type>long</type>, defined in + + <filename class="headerfile">include/linux/bitops.h</filename>. These + operations generally take a pointer to the bit pattern, and a bit + number: 0 is the least significant bit. + <function>set_bit()</function>, <function>clear_bit()</function> + and <function>change_bit()</function> set, clear, and flip the + given bit. <function>test_and_set_bit()</function>, + <function>test_and_clear_bit()</function> and + <function>test_and_change_bit()</function> do the same thing, + except return true if the bit was previously set; these are + particularly useful for very simple locking. + </para> + + <para> + It is possible to call these operations with bit indices greater + than BITS_PER_LONG. The resulting behavior is strange on big-endian + platforms though so it is a good idea not to do this. + </para> + + <para> + Note that the order of bits depends on the architecture, and in + particular, the bitfield passed to these operations must be at + least as large as a <type>long</type>. + </para> + </chapter> + + <chapter id="symbols"> + <title>Symbols</title> + + <para> + Within the kernel proper, the normal linking rules apply + (ie. unless a symbol is declared to be file scope with the + <type>static</type> keyword, it can be used anywhere in the + kernel). However, for modules, a special exported symbol table is + kept which limits the entry points to the kernel proper. Modules + can also export symbols. + </para> + + <sect1 id="sym-exportsymbols"> + <title><function>EXPORT_SYMBOL()</function> + <filename class="headerfile">include/linux/module.h</filename></title> + + <para> + This is the classic method of exporting a symbol, and it works + for both modules and non-modules. In the kernel all these + declarations are often bundled into a single file to help + genksyms (which searches source files for these declarations). + See the comment on genksyms and Makefiles below. + </para> + </sect1> + + <sect1 id="sym-exportsymbols-gpl"> + <title><function>EXPORT_SYMBOL_GPL()</function> + <filename class="headerfile">include/linux/module.h</filename></title> + + <para> + Similar to <function>EXPORT_SYMBOL()</function> except that the + symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can + only be seen by modules with a + <function>MODULE_LICENSE()</function> that specifies a GPL + compatible license. + </para> + </sect1> + </chapter> + + <chapter id="conventions"> + <title>Routines and Conventions</title> + + <sect1 id="conventions-doublelinkedlist"> + <title>Double-linked lists + <filename class="headerfile">include/linux/list.h</filename></title> + + <para> + There are three sets of linked-list routines in the kernel + headers, but this one seems to be winning out (and Linus has + used it). If you don't have some particular pressing need for + a single list, it's a good choice. In fact, I don't care + whether it's a good choice or not, just use it so we can get + rid of the others. + </para> + </sect1> + + <sect1 id="convention-returns"> + <title>Return Conventions</title> + + <para> + For code called in user context, it's very common to defy C + convention, and return <returnvalue>0</returnvalue> for success, + and a negative error number + (eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be + unintuitive at first, but it's fairly widespread in the networking + code, for example. + </para> + + <para> + The filesystem code uses <function>ERR_PTR()</function> + + <filename class="headerfile">include/linux/fs.h</filename>; to + encode a negative error number into a pointer, and + <function>IS_ERR()</function> and <function>PTR_ERR()</function> + to get it back out again: avoids a separate pointer parameter for + the error number. Icky, but in a good way. + </para> + </sect1> + + <sect1 id="conventions-borkedcompile"> + <title>Breaking Compilation</title> + + <para> + Linus and the other developers sometimes change function or + structure names in development kernels; this is not done just to + keep everyone on their toes: it reflects a fundamental change + (eg. can no longer be called with interrupts on, or does extra + checks, or doesn't do checks which were caught before). Usually + this is accompanied by a fairly complete note to the linux-kernel + mailing list; search the archive. Simply doing a global replace + on the file usually makes things <emphasis>worse</emphasis>. + </para> + </sect1> + + <sect1 id="conventions-initialising"> + <title>Initializing structure members</title> + + <para> + The preferred method of initializing structures is to use + designated initialisers, as defined by ISO C99, eg: + </para> + <programlisting> +static struct block_device_operations opt_fops = { + .open = opt_open, + .release = opt_release, + .ioctl = opt_ioctl, + .check_media_change = opt_media_change, +}; + </programlisting> + <para> + This makes it easy to grep for, and makes it clear which + structure fields are set. You should do this because it looks + cool. + </para> + </sect1> + + <sect1 id="conventions-gnu-extns"> + <title>GNU Extensions</title> + + <para> + GNU Extensions are explicitly allowed in the Linux kernel. + Note that some of the more complex ones are not very well + supported, due to lack of general use, but the following are + considered standard (see the GCC info page section "C + Extensions" for more details - Yes, really the info page, the + man page is only a short summary of the stuff in info): + </para> + <itemizedlist> + <listitem> + <para> + Inline functions + </para> + </listitem> + <listitem> + <para> + Statement expressions (ie. the ({ and }) constructs). + </para> + </listitem> + <listitem> + <para> + Declaring attributes of a function / variable / type + (__attribute__) + </para> + </listitem> + <listitem> + <para> + typeof + </para> + </listitem> + <listitem> + <para> + Zero length arrays + </para> + </listitem> + <listitem> + <para> + Macro varargs + </para> + </listitem> + <listitem> + <para> + Arithmetic on void pointers + </para> + </listitem> + <listitem> + <para> + Non-Constant initializers + </para> + </listitem> + <listitem> + <para> + Assembler Instructions (not outside arch/ and include/asm/) + </para> + </listitem> + <listitem> + <para> + Function names as strings (__FUNCTION__) + </para> + </listitem> + <listitem> + <para> + __builtin_constant_p() + </para> + </listitem> + </itemizedlist> + + <para> + Be wary when using long long in the kernel, the code gcc generates for + it is horrible and worse: division and multiplication does not work + on i386 because the GCC runtime functions for it are missing from + the kernel environment. + </para> + + <!-- FIXME: add a note about ANSI aliasing cleanness --> + </sect1> + + <sect1 id="conventions-cplusplus"> + <title>C++</title> + + <para> + Using C++ in the kernel is usually a bad idea, because the + kernel does not provide the necessary runtime environment + and the include files are not tested for it. It is still + possible, but not recommended. If you really want to do + this, forget about exceptions at least. + </para> + </sect1> + + <sect1 id="conventions-ifdef"> + <title>#if</title> + + <para> + It is generally considered cleaner to use macros in header files + (or at the top of .c files) to abstract away functions rather than + using `#if' pre-processor statements throughout the source code. + </para> + </sect1> + </chapter> + + <chapter id="submitting"> + <title>Putting Your Stuff in the Kernel</title> + + <para> + In order to get your stuff into shape for official inclusion, or + even to make a neat patch, there's administrative work to be + done: + </para> + <itemizedlist> + <listitem> + <para> + Figure out whose pond you've been pissing in. Look at the top of + the source files, inside the <filename>MAINTAINERS</filename> + file, and last of all in the <filename>CREDITS</filename> file. + You should coordinate with this person to make sure you're not + duplicating effort, or trying something that's already been + rejected. + </para> + + <para> + Make sure you put your name and EMail address at the top of + any files you create or mangle significantly. This is the + first place people will look when they find a bug, or when + <emphasis>they</emphasis> want to make a change. + </para> + </listitem> + + <listitem> + <para> + Usually you want a configuration option for your kernel hack. + Edit <filename>Config.in</filename> in the appropriate directory + (but under <filename>arch/</filename> it's called + <filename>config.in</filename>). The Config Language used is not + bash, even though it looks like bash; the safe way is to use only + the constructs that you already see in + <filename>Config.in</filename> files (see + <filename>Documentation/kbuild/kconfig-language.txt</filename>). + It's good to run "make xconfig" at least once to test (because + it's the only one with a static parser). + </para> + + <para> + Variables which can be Y or N use <type>bool</type> followed by a + tagline and the config define name (which must start with + CONFIG_). The <type>tristate</type> function is the same, but + allows the answer M (which defines + <symbol>CONFIG_foo_MODULE</symbol> in your source, instead of + <symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol> + is enabled. + </para> + + <para> + You may well want to make your CONFIG option only visible if + <symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a + warning to users. There many other fancy things you can do: see + the various <filename>Config.in</filename> files for ideas. + </para> + </listitem> + + <listitem> + <para> + Edit the <filename>Makefile</filename>: the CONFIG variables are + exported here so you can conditionalize compilation with `ifeq'. + If your file exports symbols then add the names to + <varname>export-objs</varname> so that genksyms will find them. + <caution> + <para> + There is a restriction on the kernel build system that objects + which export symbols must have globally unique names. + If your object does not have a globally unique name then the + standard fix is to move the + <function>EXPORT_SYMBOL()</function> statements to their own + object with a unique name. + This is why several systems have separate exporting objects, + usually suffixed with ksyms. + </para> + </caution> + </para> + </listitem> + + <listitem> + <para> + Document your option in Documentation/Configure.help. Mention + incompatibilities and issues here. <emphasis> Definitely + </emphasis> end your description with <quote> if in doubt, say N + </quote> (or, occasionally, `Y'); this is for people who have no + idea what you are talking about. + </para> + </listitem> + + <listitem> + <para> + Put yourself in <filename>CREDITS</filename> if you've done + something noteworthy, usually beyond a single file (your name + should be at the top of the source files anyway). + <filename>MAINTAINERS</filename> means you want to be consulted + when changes are made to a subsystem, and hear about bugs; it + implies a more-than-passing commitment to some part of the code. + </para> + </listitem> + + <listitem> + <para> + Finally, don't forget to read <filename>Documentation/SubmittingPatches</filename> + and possibly <filename>Documentation/SubmittingDrivers</filename>. + </para> + </listitem> + </itemizedlist> + </chapter> + + <chapter id="cantrips"> + <title>Kernel Cantrips</title> + + <para> + Some favorites from browsing the source. Feel free to add to this + list. + </para> + + <para> + <filename>include/linux/brlock.h:</filename> + </para> + <programlisting> +extern inline void br_read_lock (enum brlock_indices idx) +{ + /* + * This causes a link-time bug message if an + * invalid index is used: + */ + if (idx >= __BR_END) + __br_lock_usage_bug(); + + read_lock(&__brlock_array[smp_processor_id()][idx]); +} + </programlisting> + + <para> + <filename>include/linux/fs.h</filename>: + </para> + <programlisting> +/* + * Kernel pointers have redundant information, so we can use a + * scheme where we can return either an error code or a dentry + * pointer with the same return value. + * + * This should be a per-architecture thing, to allow different + * error and pointer decisions. + */ + #define ERR_PTR(err) ((void *)((long)(err))) + #define PTR_ERR(ptr) ((long)(ptr)) + #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) +</programlisting> + + <para> + <filename>include/asm-i386/uaccess.h:</filename> + </para> + + <programlisting> +#define copy_to_user(to,from,n) \ + (__builtin_constant_p(n) ? \ + __constant_copy_to_user((to),(from),(n)) : \ + __generic_copy_to_user((to),(from),(n))) + </programlisting> + + <para> + <filename>arch/sparc/kernel/head.S:</filename> + </para> + + <programlisting> +/* + * Sun people can't spell worth damn. "compatability" indeed. + * At least we *know* we can't spell, and use a spell-checker. + */ + +/* Uh, actually Linus it is I who cannot spell. Too much murky + * Sparc assembly will do this to ya. + */ +C_LABEL(cputypvar): + .asciz "compatability" + +/* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ + .align 4 +C_LABEL(cputypvar_sun4m): + .asciz "compatible" + </programlisting> + + <para> + <filename>arch/sparc/lib/checksum.S:</filename> + </para> + + <programlisting> + /* Sun, you just can't beat me, you just can't. Stop trying, + * give up. I'm serious, I am going to kick the living shit + * out of you, game over, lights out. + */ + </programlisting> + </chapter> + + <chapter id="credits"> + <title>Thanks</title> + + <para> + Thanks to Andi Kleen for the idea, answering my questions, fixing + my mistakes, filling content, etc. Philipp Rumpf for more spelling + and clarity fixes, and some excellent non-obvious points. Werner + Almesberger for giving me a great summary of + <function>disable_irq()</function>, and Jes Sorensen and Andrea + Arcangeli added caveats. Michael Elizabeth Chastain for checking + and adding to the Configure section. <!-- Rusty insisted on this + bit; I didn't do it! --> Telsa Gwynne for teaching me DocBook. + </para> + </chapter> +</book> + diff --git a/Documentation/DocBook/kernel-locking.tmpl b/Documentation/DocBook/kernel-locking.tmpl new file mode 100644 index 000000000000..90dc2de8e0af --- /dev/null +++ b/Documentation/DocBook/kernel-locking.tmpl @@ -0,0 +1,2088 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="LKLockingGuide"> + <bookinfo> + <title>Unreliable Guide To Locking</title> + + <authorgroup> + <author> + <firstname>Rusty</firstname> + <surname>Russell</surname> + <affiliation> + <address> + <email>rusty@rustcorp.com.au</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2003</year> + <holder>Rusty Russell</holder> + </copyright> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + + <toc></toc> + <chapter id="intro"> + <title>Introduction</title> + <para> + Welcome, to Rusty's Remarkably Unreliable Guide to Kernel + Locking issues. This document describes the locking systems in + the Linux Kernel in 2.6. + </para> + <para> + With the wide availability of HyperThreading, and <firstterm + linkend="gloss-preemption">preemption </firstterm> in the Linux + Kernel, everyone hacking on the kernel needs to know the + fundamentals of concurrency and locking for + <firstterm linkend="gloss-smp"><acronym>SMP</acronym></firstterm>. + </para> + </chapter> + + <chapter id="races"> + <title>The Problem With Concurrency</title> + <para> + (Skip this if you know what a Race Condition is). + </para> + <para> + In a normal program, you can increment a counter like so: + </para> + <programlisting> + very_important_count++; + </programlisting> + + <para> + This is what they would expect to happen: + </para> + + <table> + <title>Expected Results</title> + + <tgroup cols="2" align="left"> + + <thead> + <row> + <entry>Instance 1</entry> + <entry>Instance 2</entry> + </row> + </thead> + + <tbody> + <row> + <entry>read very_important_count (5)</entry> + <entry></entry> + </row> + <row> + <entry>add 1 (6)</entry> + <entry></entry> + </row> + <row> + <entry>write very_important_count (6)</entry> + <entry></entry> + </row> + <row> + <entry></entry> + <entry>read very_important_count (6)</entry> + </row> + <row> + <entry></entry> + <entry>add 1 (7)</entry> + </row> + <row> + <entry></entry> + <entry>write very_important_count (7)</entry> + </row> + </tbody> + + </tgroup> + </table> + + <para> + This is what might happen: + </para> + + <table> + <title>Possible Results</title> + + <tgroup cols="2" align="left"> + <thead> + <row> + <entry>Instance 1</entry> + <entry>Instance 2</entry> + </row> + </thead> + + <tbody> + <row> + <entry>read very_important_count (5)</entry> + <entry></entry> + </row> + <row> + <entry></entry> + <entry>read very_important_count (5)</entry> + </row> + <row> + <entry>add 1 (6)</entry> + <entry></entry> + </row> + <row> + <entry></entry> + <entry>add 1 (6)</entry> + </row> + <row> + <entry>write very_important_count (6)</entry> + <entry></entry> + </row> + <row> + <entry></entry> + <entry>write very_important_count (6)</entry> + </row> + </tbody> + </tgroup> + </table> + + <sect1 id="race-condition"> + <title>Race Conditions and Critical Regions</title> + <para> + This overlap, where the result depends on the + relative timing of multiple tasks, is called a <firstterm>race condition</firstterm>. + The piece of code containing the concurrency issue is called a + <firstterm>critical region</firstterm>. And especially since Linux starting running + on SMP machines, they became one of the major issues in kernel + design and implementation. + </para> + <para> + Preemption can have the same effect, even if there is only one + CPU: by preempting one task during the critical region, we have + exactly the same race condition. In this case the thread which + preempts might run the critical region itself. + </para> + <para> + The solution is to recognize when these simultaneous accesses + occur, and use locks to make sure that only one instance can + enter the critical region at any time. There are many + friendly primitives in the Linux kernel to help you do this. + And then there are the unfriendly primitives, but I'll pretend + they don't exist. + </para> + </sect1> + </chapter> + + <chapter id="locks"> + <title>Locking in the Linux Kernel</title> + + <para> + If I could give you one piece of advice: never sleep with anyone + crazier than yourself. But if I had to give you advice on + locking: <emphasis>keep it simple</emphasis>. + </para> + + <para> + Be reluctant to introduce new locks. + </para> + + <para> + Strangely enough, this last one is the exact reverse of my advice when + you <emphasis>have</emphasis> slept with someone crazier than yourself. + And you should think about getting a big dog. + </para> + + <sect1 id="lock-intro"> + <title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title> + + <para> + There are two main types of kernel locks. The fundamental type + is the spinlock + (<filename class="headerfile">include/asm/spinlock.h</filename>), + which is a very simple single-holder lock: if you can't get the + spinlock, you keep trying (spinning) until you can. Spinlocks are + very small and fast, and can be used anywhere. + </para> + <para> + The second type is a semaphore + (<filename class="headerfile">include/asm/semaphore.h</filename>): it + can have more than one holder at any time (the number decided at + initialization time), although it is most commonly used as a + single-holder lock (a mutex). If you can't get a semaphore, + your task will put itself on the queue, and be woken up when the + semaphore is released. This means the CPU will do something + else while you are waiting, but there are many cases when you + simply can't sleep (see <xref linkend="sleeping-things"/>), and so + have to use a spinlock instead. + </para> + <para> + Neither type of lock is recursive: see + <xref linkend="deadlock"/>. + </para> + </sect1> + + <sect1 id="uniprocessor"> + <title>Locks and Uniprocessor Kernels</title> + + <para> + For kernels compiled without <symbol>CONFIG_SMP</symbol>, and + without <symbol>CONFIG_PREEMPT</symbol> spinlocks do not exist at + all. This is an excellent design decision: when no-one else can + run at the same time, there is no reason to have a lock. + </para> + + <para> + If the kernel is compiled without <symbol>CONFIG_SMP</symbol>, + but <symbol>CONFIG_PREEMPT</symbol> is set, then spinlocks + simply disable preemption, which is sufficient to prevent any + races. For most purposes, we can think of preemption as + equivalent to SMP, and not worry about it separately. + </para> + + <para> + You should always test your locking code with <symbol>CONFIG_SMP</symbol> + and <symbol>CONFIG_PREEMPT</symbol> enabled, even if you don't have an SMP test box, because it + will still catch some kinds of locking bugs. + </para> + + <para> + Semaphores still exist, because they are required for + synchronization between <firstterm linkend="gloss-usercontext">user + contexts</firstterm>, as we will see below. + </para> + </sect1> + + <sect1 id="usercontextlocking"> + <title>Locking Only In User Context</title> + + <para> + If you have a data structure which is only ever accessed from + user context, then you can use a simple semaphore + (<filename>linux/asm/semaphore.h</filename>) to protect it. This + is the most trivial case: you initialize the semaphore to the number + of resources available (usually 1), and call + <function>down_interruptible()</function> to grab the semaphore, and + <function>up()</function> to release it. There is also a + <function>down()</function>, which should be avoided, because it + will not return if a signal is received. + </para> + + <para> + Example: <filename>linux/net/core/netfilter.c</filename> allows + registration of new <function>setsockopt()</function> and + <function>getsockopt()</function> calls, with + <function>nf_register_sockopt()</function>. Registration and + de-registration are only done on module load and unload (and boot + time, where there is no concurrency), and the list of registrations + is only consulted for an unknown <function>setsockopt()</function> + or <function>getsockopt()</function> system call. The + <varname>nf_sockopt_mutex</varname> is perfect to protect this, + especially since the setsockopt and getsockopt calls may well + sleep. + </para> + </sect1> + + <sect1 id="lock-user-bh"> + <title>Locking Between User Context and Softirqs</title> + + <para> + If a <firstterm linkend="gloss-softirq">softirq</firstterm> shares + data with user context, you have two problems. Firstly, the current + user context can be interrupted by a softirq, and secondly, the + critical region could be entered from another CPU. This is where + <function>spin_lock_bh()</function> + (<filename class="headerfile">include/linux/spinlock.h</filename>) is + used. It disables softirqs on that CPU, then grabs the lock. + <function>spin_unlock_bh()</function> does the reverse. (The + '_bh' suffix is a historical reference to "Bottom Halves", the + old name for software interrupts. It should really be + called spin_lock_softirq()' in a perfect world). + </para> + + <para> + Note that you can also use <function>spin_lock_irq()</function> + or <function>spin_lock_irqsave()</function> here, which stop + hardware interrupts as well: see <xref linkend="hardirq-context"/>. + </para> + + <para> + This works perfectly for <firstterm linkend="gloss-up"><acronym>UP + </acronym></firstterm> as well: the spin lock vanishes, and this macro + simply becomes <function>local_bh_disable()</function> + (<filename class="headerfile">include/linux/interrupt.h</filename>), which + protects you from the softirq being run. + </para> + </sect1> + + <sect1 id="lock-user-tasklet"> + <title>Locking Between User Context and Tasklets</title> + + <para> + This is exactly the same as above, because <firstterm + linkend="gloss-tasklet">tasklets</firstterm> are actually run + from a softirq. + </para> + </sect1> + + <sect1 id="lock-user-timers"> + <title>Locking Between User Context and Timers</title> + + <para> + This, too, is exactly the same as above, because <firstterm + linkend="gloss-timers">timers</firstterm> are actually run from + a softirq. From a locking point of view, tasklets and timers + are identical. + </para> + </sect1> + + <sect1 id="lock-tasklets"> + <title>Locking Between Tasklets/Timers</title> + + <para> + Sometimes a tasklet or timer might want to share data with + another tasklet or timer. + </para> + + <sect2 id="lock-tasklets-same"> + <title>The Same Tasklet/Timer</title> + <para> + Since a tasklet is never run on two CPUs at once, you don't + need to worry about your tasklet being reentrant (running + twice at once), even on SMP. + </para> + </sect2> + + <sect2 id="lock-tasklets-different"> + <title>Different Tasklets/Timers</title> + <para> + If another tasklet/timer wants + to share data with your tasklet or timer , you will both need to use + <function>spin_lock()</function> and + <function>spin_unlock()</function> calls. + <function>spin_lock_bh()</function> is + unnecessary here, as you are already in a tasklet, and + none will be run on the same CPU. + </para> + </sect2> + </sect1> + + <sect1 id="lock-softirqs"> + <title>Locking Between Softirqs</title> + + <para> + Often a softirq might + want to share data with itself or a tasklet/timer. + </para> + + <sect2 id="lock-softirqs-same"> + <title>The Same Softirq</title> + + <para> + The same softirq can run on the other CPUs: you can use a + per-CPU array (see <xref linkend="per-cpu"/>) for better + performance. If you're going so far as to use a softirq, + you probably care about scalable performance enough + to justify the extra complexity. + </para> + + <para> + You'll need to use <function>spin_lock()</function> and + <function>spin_unlock()</function> for shared data. + </para> + </sect2> + + <sect2 id="lock-softirqs-different"> + <title>Different Softirqs</title> + + <para> + You'll need to use <function>spin_lock()</function> and + <function>spin_unlock()</function> for shared data, whether it + be a timer, tasklet, different softirq or the same or another + softirq: any of them could be running on a different CPU. + </para> + </sect2> + </sect1> + </chapter> + + <chapter id="hardirq-context"> + <title>Hard IRQ Context</title> + + <para> + Hardware interrupts usually communicate with a + tasklet or softirq. Frequently this involves putting work in a + queue, which the softirq will take out. + </para> + + <sect1 id="hardirq-softirq"> + <title>Locking Between Hard IRQ and Softirqs/Tasklets</title> + + <para> + If a hardware irq handler shares data with a softirq, you have + two concerns. Firstly, the softirq processing can be + interrupted by a hardware interrupt, and secondly, the + critical region could be entered by a hardware interrupt on + another CPU. This is where <function>spin_lock_irq()</function> is + used. It is defined to disable interrupts on that cpu, then grab + the lock. <function>spin_unlock_irq()</function> does the reverse. + </para> + + <para> + The irq handler does not to use + <function>spin_lock_irq()</function>, because the softirq cannot + run while the irq handler is running: it can use + <function>spin_lock()</function>, which is slightly faster. The + only exception would be if a different hardware irq handler uses + the same lock: <function>spin_lock_irq()</function> will stop + that from interrupting us. + </para> + + <para> + This works perfectly for UP as well: the spin lock vanishes, + and this macro simply becomes <function>local_irq_disable()</function> + (<filename class="headerfile">include/asm/smp.h</filename>), which + protects you from the softirq/tasklet/BH being run. + </para> + + <para> + <function>spin_lock_irqsave()</function> + (<filename>include/linux/spinlock.h</filename>) is a variant + which saves whether interrupts were on or off in a flags word, + which is passed to <function>spin_unlock_irqrestore()</function>. This + means that the same code can be used inside an hard irq handler (where + interrupts are already off) and in softirqs (where the irq + disabling is required). + </para> + + <para> + Note that softirqs (and hence tasklets and timers) are run on + return from hardware interrupts, so + <function>spin_lock_irq()</function> also stops these. In that + sense, <function>spin_lock_irqsave()</function> is the most + general and powerful locking function. + </para> + + </sect1> + <sect1 id="hardirq-hardirq"> + <title>Locking Between Two Hard IRQ Handlers</title> + <para> + It is rare to have to share data between two IRQ handlers, but + if you do, <function>spin_lock_irqsave()</function> should be + used: it is architecture-specific whether all interrupts are + disabled inside irq handlers themselves. + </para> + </sect1> + + </chapter> + + <chapter id="cheatsheet"> + <title>Cheat Sheet For Locking</title> + <para> + Pete Zaitcev gives the following summary: + </para> + <itemizedlist> + <listitem> + <para> + If you are in a process context (any syscall) and want to + lock other process out, use a semaphore. You can take a semaphore + and sleep (<function>copy_from_user*(</function> or + <function>kmalloc(x,GFP_KERNEL)</function>). + </para> + </listitem> + <listitem> + <para> + Otherwise (== data can be touched in an interrupt), use + <function>spin_lock_irqsave()</function> and + <function>spin_unlock_irqrestore()</function>. + </para> + </listitem> + <listitem> + <para> + Avoid holding spinlock for more than 5 lines of code and + across any function call (except accessors like + <function>readb</function>). + </para> + </listitem> + </itemizedlist> + + <sect1 id="minimum-lock-reqirements"> + <title>Table of Minimum Requirements</title> + + <para> The following table lists the <emphasis>minimum</emphasis> + locking requirements between various contexts. In some cases, + the same context can only be running on one CPU at a time, so + no locking is required for that context (eg. a particular + thread can only run on one CPU at a time, but if it needs + shares data with another thread, locking is required). + </para> + <para> + Remember the advice above: you can always use + <function>spin_lock_irqsave()</function>, which is a superset + of all other spinlock primitives. + </para> + <table> +<title>Table of Locking Requirements</title> +<tgroup cols="11"> +<tbody> +<row> +<entry></entry> +<entry>IRQ Handler A</entry> +<entry>IRQ Handler B</entry> +<entry>Softirq A</entry> +<entry>Softirq B</entry> +<entry>Tasklet A</entry> +<entry>Tasklet B</entry> +<entry>Timer A</entry> +<entry>Timer B</entry> +<entry>User Context A</entry> +<entry>User Context B</entry> +</row> + +<row> +<entry>IRQ Handler A</entry> +<entry>None</entry> +</row> + +<row> +<entry>IRQ Handler B</entry> +<entry>spin_lock_irqsave</entry> +<entry>None</entry> +</row> + +<row> +<entry>Softirq A</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock</entry> +</row> + +<row> +<entry>Softirq B</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +</row> + +<row> +<entry>Tasklet A</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>None</entry> +</row> + +<row> +<entry>Tasklet B</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>None</entry> +</row> + +<row> +<entry>Timer A</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>None</entry> +</row> + +<row> +<entry>Timer B</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>spin_lock</entry> +<entry>None</entry> +</row> + +<row> +<entry>User Context A</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>None</entry> +</row> + +<row> +<entry>User Context B</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_irq</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>spin_lock_bh</entry> +<entry>down_interruptible</entry> +<entry>None</entry> +</row> + +</tbody> +</tgroup> +</table> +</sect1> +</chapter> + + <chapter id="Examples"> + <title>Common Examples</title> + <para> +Let's step through a simple example: a cache of number to name +mappings. The cache keeps a count of how often each of the objects is +used, and when it gets full, throws out the least used one. + + </para> + + <sect1 id="examples-usercontext"> + <title>All In User Context</title> + <para> +For our first example, we assume that all operations are in user +context (ie. from system calls), so we can sleep. This means we can +use a semaphore to protect the cache and all the objects within +it. Here's the code: + </para> + + <programlisting> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/string.h> +#include <asm/semaphore.h> +#include <asm/errno.h> + +struct object +{ + struct list_head list; + int id; + char name[32]; + int popularity; +}; + +/* Protects the cache, cache_num, and the objects within it */ +static DECLARE_MUTEX(cache_lock); +static LIST_HEAD(cache); +static unsigned int cache_num = 0; +#define MAX_CACHE_SIZE 10 + +/* Must be holding cache_lock */ +static struct object *__cache_find(int id) +{ + struct object *i; + + list_for_each_entry(i, &cache, list) + if (i->id == id) { + i->popularity++; + return i; + } + return NULL; +} + +/* Must be holding cache_lock */ +static void __cache_delete(struct object *obj) +{ + BUG_ON(!obj); + list_del(&obj->list); + kfree(obj); + cache_num--; +} + +/* Must be holding cache_lock */ +static void __cache_add(struct object *obj) +{ + list_add(&obj->list, &cache); + if (++cache_num > MAX_CACHE_SIZE) { + struct object *i, *outcast = NULL; + list_for_each_entry(i, &cache, list) { + if (!outcast || i->popularity < outcast->popularity) + outcast = i; + } + __cache_delete(outcast); + } +} + +int cache_add(int id, const char *name) +{ + struct object *obj; + + if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) + return -ENOMEM; + + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; + + down(&cache_lock); + __cache_add(obj); + up(&cache_lock); + return 0; +} + +void cache_delete(int id) +{ + down(&cache_lock); + __cache_delete(__cache_find(id)); + up(&cache_lock); +} + +int cache_find(int id, char *name) +{ + struct object *obj; + int ret = -ENOENT; + + down(&cache_lock); + obj = __cache_find(id); + if (obj) { + ret = 0; + strcpy(name, obj->name); + } + up(&cache_lock); + return ret; +} +</programlisting> + + <para> +Note that we always make sure we have the cache_lock when we add, +delete, or look up the cache: both the cache infrastructure itself and +the contents of the objects are protected by the lock. In this case +it's easy, since we copy the data for the user, and never let them +access the objects directly. + </para> + <para> +There is a slight (and common) optimization here: in +<function>cache_add</function> we set up the fields of the object +before grabbing the lock. This is safe, as no-one else can access it +until we put it in cache. + </para> + </sect1> + + <sect1 id="examples-interrupt"> + <title>Accessing From Interrupt Context</title> + <para> +Now consider the case where <function>cache_find</function> can be +called from interrupt context: either a hardware interrupt or a +softirq. An example would be a timer which deletes object from the +cache. + </para> + <para> +The change is shown below, in standard patch format: the +<symbol>-</symbol> are lines which are taken away, and the +<symbol>+</symbol> are lines which are added. + </para> +<programlisting> +--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 ++++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 +@@ -12,7 +12,7 @@ + int popularity; + }; + +-static DECLARE_MUTEX(cache_lock); ++static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; + static LIST_HEAD(cache); + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 +@@ -55,6 +55,7 @@ + int cache_add(int id, const char *name) + { + struct object *obj; ++ unsigned long flags; + + if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) + return -ENOMEM; +@@ -63,30 +64,33 @@ + obj->id = id; + obj->popularity = 0; + +- down(&cache_lock); ++ spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + return 0; + } + + void cache_delete(int id) + { +- down(&cache_lock); ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); + __cache_delete(__cache_find(id)); +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + } + + int cache_find(int id, char *name) + { + struct object *obj; + int ret = -ENOENT; ++ unsigned long flags; + +- down(&cache_lock); ++ spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + if (obj) { + ret = 0; + strcpy(name, obj->name); + } +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + return ret; + } +</programlisting> + + <para> +Note that the <function>spin_lock_irqsave</function> will turn off +interrupts if they are on, otherwise does nothing (if we are already +in an interrupt handler), hence these functions are safe to call from +any context. + </para> + <para> +Unfortunately, <function>cache_add</function> calls +<function>kmalloc</function> with the <symbol>GFP_KERNEL</symbol> +flag, which is only legal in user context. I have assumed that +<function>cache_add</function> is still only called in user context, +otherwise this should become a parameter to +<function>cache_add</function>. + </para> + </sect1> + <sect1 id="examples-refcnt"> + <title>Exposing Objects Outside This File</title> + <para> +If our objects contained more information, it might not be sufficient +to copy the information in and out: other parts of the code might want +to keep pointers to these objects, for example, rather than looking up +the id every time. This produces two problems. + </para> + <para> +The first problem is that we use the <symbol>cache_lock</symbol> to +protect objects: we'd need to make this non-static so the rest of the +code can use it. This makes locking trickier, as it is no longer all +in one place. + </para> + <para> +The second problem is the lifetime problem: if another structure keeps +a pointer to an object, it presumably expects that pointer to remain +valid. Unfortunately, this is only guaranteed while you hold the +lock, otherwise someone might call <function>cache_delete</function> +and even worse, add another object, re-using the same address. + </para> + <para> +As there is only one lock, you can't hold it forever: no-one else would +get any work done. + </para> + <para> +The solution to this problem is to use a reference count: everyone who +has a pointer to the object increases it when they first get the +object, and drops the reference count when they're finished with it. +Whoever drops it to zero knows it is unused, and can actually delete it. + </para> + <para> +Here is the code: + </para> + +<programlisting> +--- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 ++++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 +@@ -7,6 +7,7 @@ + struct object + { + struct list_head list; ++ unsigned int refcnt; + int id; + char name[32]; + int popularity; +@@ -17,6 +18,35 @@ + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + ++static void __object_put(struct object *obj) ++{ ++ if (--obj->refcnt == 0) ++ kfree(obj); ++} ++ ++static void __object_get(struct object *obj) ++{ ++ obj->refcnt++; ++} ++ ++void object_put(struct object *obj) ++{ ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); ++ __object_put(obj); ++ spin_unlock_irqrestore(&cache_lock, flags); ++} ++ ++void object_get(struct object *obj) ++{ ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); ++ __object_get(obj); ++ spin_unlock_irqrestore(&cache_lock, flags); ++} ++ + /* Must be holding cache_lock */ + static struct object *__cache_find(int id) + { +@@ -35,6 +65,7 @@ + { + BUG_ON(!obj); + list_del(&obj->list); ++ __object_put(obj); + cache_num--; + } + +@@ -63,6 +94,7 @@ + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; ++ obj->refcnt = 1; /* The cache holds a reference */ + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -79,18 +111,15 @@ + spin_unlock_irqrestore(&cache_lock, flags); + } + +-int cache_find(int id, char *name) ++struct object *cache_find(int id) + { + struct object *obj; +- int ret = -ENOENT; + unsigned long flags; + + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); +- if (obj) { +- ret = 0; +- strcpy(name, obj->name); +- } ++ if (obj) ++ __object_get(obj); + spin_unlock_irqrestore(&cache_lock, flags); +- return ret; ++ return obj; + } +</programlisting> + +<para> +We encapsulate the reference counting in the standard 'get' and 'put' +functions. Now we can return the object itself from +<function>cache_find</function> which has the advantage that the user +can now sleep holding the object (eg. to +<function>copy_to_user</function> to name to userspace). +</para> +<para> +The other point to note is that I said a reference should be held for +every pointer to the object: thus the reference count is 1 when first +inserted into the cache. In some versions the framework does not hold +a reference count, but they are more complicated. +</para> + + <sect2 id="examples-refcnt-atomic"> + <title>Using Atomic Operations For The Reference Count</title> +<para> +In practice, <type>atomic_t</type> would usually be used for +<structfield>refcnt</structfield>. There are a number of atomic +operations defined in + +<filename class="headerfile">include/asm/atomic.h</filename>: these are +guaranteed to be seen atomically from all CPUs in the system, so no +lock is required. In this case, it is simpler than using spinlocks, +although for anything non-trivial using spinlocks is clearer. The +<function>atomic_inc</function> and +<function>atomic_dec_and_test</function> are used instead of the +standard increment and decrement operators, and the lock is no longer +used to protect the reference count itself. +</para> + +<programlisting> +--- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 ++++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 +@@ -7,7 +7,7 @@ + struct object + { + struct list_head list; +- unsigned int refcnt; ++ atomic_t refcnt; + int id; + char name[32]; + int popularity; +@@ -18,33 +18,15 @@ + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + +-static void __object_put(struct object *obj) +-{ +- if (--obj->refcnt == 0) +- kfree(obj); +-} +- +-static void __object_get(struct object *obj) +-{ +- obj->refcnt++; +-} +- + void object_put(struct object *obj) + { +- unsigned long flags; +- +- spin_lock_irqsave(&cache_lock, flags); +- __object_put(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ if (atomic_dec_and_test(&obj->refcnt)) ++ kfree(obj); + } + + void object_get(struct object *obj) + { +- unsigned long flags; +- +- spin_lock_irqsave(&cache_lock, flags); +- __object_get(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ atomic_inc(&obj->refcnt); + } + + /* Must be holding cache_lock */ +@@ -65,7 +47,7 @@ + { + BUG_ON(!obj); + list_del(&obj->list); +- __object_put(obj); ++ object_put(obj); + cache_num--; + } + +@@ -94,7 +76,7 @@ + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; +- obj->refcnt = 1; /* The cache holds a reference */ ++ atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -119,7 +101,7 @@ + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + if (obj) +- __object_get(obj); ++ object_get(obj); + spin_unlock_irqrestore(&cache_lock, flags); + return obj; + } +</programlisting> +</sect2> +</sect1> + + <sect1 id="examples-lock-per-obj"> + <title>Protecting The Objects Themselves</title> + <para> +In these examples, we assumed that the objects (except the reference +counts) never changed once they are created. If we wanted to allow +the name to change, there are three possibilities: + </para> + <itemizedlist> + <listitem> + <para> +You can make <symbol>cache_lock</symbol> non-static, and tell people +to grab that lock before changing the name in any object. + </para> + </listitem> + <listitem> + <para> +You can provide a <function>cache_obj_rename</function> which grabs +this lock and changes the name for the caller, and tell everyone to +use that function. + </para> + </listitem> + <listitem> + <para> +You can make the <symbol>cache_lock</symbol> protect only the cache +itself, and use another lock to protect the name. + </para> + </listitem> + </itemizedlist> + + <para> +Theoretically, you can make the locks as fine-grained as one lock for +every field, for every object. In practice, the most common variants +are: +</para> + <itemizedlist> + <listitem> + <para> +One lock which protects the infrastructure (the <symbol>cache</symbol> +list in this example) and all the objects. This is what we have done +so far. + </para> + </listitem> + <listitem> + <para> +One lock which protects the infrastructure (including the list +pointers inside the objects), and one lock inside the object which +protects the rest of that object. + </para> + </listitem> + <listitem> + <para> +Multiple locks to protect the infrastructure (eg. one lock per hash +chain), possibly with a separate per-object lock. + </para> + </listitem> + </itemizedlist> + +<para> +Here is the "lock-per-object" implementation: +</para> +<programlisting> +--- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 ++++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 +@@ -6,11 +6,17 @@ + + struct object + { ++ /* These two protected by cache_lock. */ + struct list_head list; ++ int popularity; ++ + atomic_t refcnt; ++ ++ /* Doesn't change once created. */ + int id; ++ ++ spinlock_t lock; /* Protects the name */ + char name[32]; +- int popularity; + }; + + static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; +@@ -77,6 +84,7 @@ + obj->id = id; + obj->popularity = 0; + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ ++ spin_lock_init(&obj->lock); + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +</programlisting> + +<para> +Note that I decide that the <structfield>popularity</structfield> +count should be protected by the <symbol>cache_lock</symbol> rather +than the per-object lock: this is because it (like the +<structname>struct list_head</structname> inside the object) is +logically part of the infrastructure. This way, I don't need to grab +the lock of every object in <function>__cache_add</function> when +seeking the least popular. +</para> + +<para> +I also decided that the <structfield>id</structfield> member is +unchangeable, so I don't need to grab each object lock in +<function>__cache_find()</function> to examine the +<structfield>id</structfield>: the object lock is only used by a +caller who wants to read or write the <structfield>name</structfield> +field. +</para> + +<para> +Note also that I added a comment describing what data was protected by +which locks. This is extremely important, as it describes the runtime +behavior of the code, and can be hard to gain from just reading. And +as Alan Cox says, <quote>Lock data, not code</quote>. +</para> +</sect1> +</chapter> + + <chapter id="common-problems"> + <title>Common Problems</title> + <sect1 id="deadlock"> + <title>Deadlock: Simple and Advanced</title> + + <para> + There is a coding bug where a piece of code tries to grab a + spinlock twice: it will spin forever, waiting for the lock to + be released (spinlocks, rwlocks and semaphores are not + recursive in Linux). This is trivial to diagnose: not a + stay-up-five-nights-talk-to-fluffy-code-bunnies kind of + problem. + </para> + + <para> + For a slightly more complex case, imagine you have a region + shared by a softirq and user context. If you use a + <function>spin_lock()</function> call to protect it, it is + possible that the user context will be interrupted by the softirq + while it holds the lock, and the softirq will then spin + forever trying to get the same lock. + </para> + + <para> + Both of these are called deadlock, and as shown above, it can + occur even with a single CPU (although not on UP compiles, + since spinlocks vanish on kernel compiles with + <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption + in the second example). + </para> + + <para> + This complete lockup is easy to diagnose: on SMP boxes the + watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set + (<filename>include/linux/spinlock.h</filename>) will show this up + immediately when it happens. + </para> + + <para> + A more complex problem is the so-called 'deadly embrace', + involving two or more locks. Say you have a hash table: each + entry in the table is a spinlock, and a chain of hashed + objects. Inside a softirq handler, you sometimes want to + alter an object from one place in the hash to another: you + grab the spinlock of the old hash chain and the spinlock of + the new hash chain, and delete the object from the old one, + and insert it in the new one. + </para> + + <para> + There are two problems here. First, if your code ever + tries to move the object to the same chain, it will deadlock + with itself as it tries to lock it twice. Secondly, if the + same softirq on another CPU is trying to move another object + in the reverse direction, the following could happen: + </para> + + <table> + <title>Consequences</title> + + <tgroup cols="2" align="left"> + + <thead> + <row> + <entry>CPU 1</entry> + <entry>CPU 2</entry> + </row> + </thead> + + <tbody> + <row> + <entry>Grab lock A -> OK</entry> + <entry>Grab lock B -> OK</entry> + </row> + <row> + <entry>Grab lock B -> spin</entry> + <entry>Grab lock A -> spin</entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + The two CPUs will spin forever, waiting for the other to give up + their lock. It will look, smell, and feel like a crash. + </para> + </sect1> + + <sect1 id="techs-deadlock-prevent"> + <title>Preventing Deadlock</title> + + <para> + Textbooks will tell you that if you always lock in the same + order, you will never get this kind of deadlock. Practice + will tell you that this approach doesn't scale: when I + create a new lock, I don't understand enough of the kernel + to figure out where in the 5000 lock hierarchy it will fit. + </para> + + <para> + The best locks are encapsulated: they never get exposed in + headers, and are never held around calls to non-trivial + functions outside the same file. You can read through this + code and see that it will never deadlock, because it never + tries to grab another lock while it has that one. People + using your code don't even need to know you are using a + lock. + </para> + + <para> + A classic problem here is when you provide callbacks or + hooks: if you call these with the lock held, you risk simple + deadlock, or a deadly embrace (who knows what the callback + will do?). Remember, the other programmers are out to get + you, so don't do this. + </para> + + <sect2 id="techs-deadlock-overprevent"> + <title>Overzealous Prevention Of Deadlocks</title> + + <para> + Deadlocks are problematic, but not as bad as data + corruption. Code which grabs a read lock, searches a list, + fails to find what it wants, drops the read lock, grabs a + write lock and inserts the object has a race condition. + </para> + + <para> + If you don't see why, please stay the fuck away from my code. + </para> + </sect2> + </sect1> + + <sect1 id="racing-timers"> + <title>Racing Timers: A Kernel Pastime</title> + + <para> + Timers can produce their own special problems with races. + Consider a collection of objects (list, hash, etc) where each + object has a timer which is due to destroy it. + </para> + + <para> + If you want to destroy the entire collection (say on module + removal), you might do the following: + </para> + + <programlisting> + /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE + HUNGARIAN NOTATION */ + spin_lock_bh(&list_lock); + + while (list) { + struct foo *next = list->next; + del_timer(&list->timer); + kfree(list); + list = next; + } + + spin_unlock_bh(&list_lock); + </programlisting> + + <para> + Sooner or later, this will crash on SMP, because a timer can + have just gone off before the <function>spin_lock_bh()</function>, + and it will only get the lock after we + <function>spin_unlock_bh()</function>, and then try to free + the element (which has already been freed!). + </para> + + <para> + This can be avoided by checking the result of + <function>del_timer()</function>: if it returns + <returnvalue>1</returnvalue>, the timer has been deleted. + If <returnvalue>0</returnvalue>, it means (in this + case) that it is currently running, so we can do: + </para> + + <programlisting> + retry: + spin_lock_bh(&list_lock); + + while (list) { + struct foo *next = list->next; + if (!del_timer(&list->timer)) { + /* Give timer a chance to delete this */ + spin_unlock_bh(&list_lock); + goto retry; + } + kfree(list); + list = next; + } + + spin_unlock_bh(&list_lock); + </programlisting> + + <para> + Another common problem is deleting timers which restart + themselves (by calling <function>add_timer()</function> at the end + of their timer function). Because this is a fairly common case + which is prone to races, you should use <function>del_timer_sync()</function> + (<filename class="headerfile">include/linux/timer.h</filename>) + to handle this case. It returns the number of times the timer + had to be deleted before we finally stopped it from adding itself back + in. + </para> + </sect1> + + </chapter> + + <chapter id="Efficiency"> + <title>Locking Speed</title> + + <para> +There are three main things to worry about when considering speed of +some code which does locking. First is concurrency: how many things +are going to be waiting while someone else is holding a lock. Second +is the time taken to actually acquire and release an uncontended lock. +Third is using fewer, or smarter locks. I'm assuming that the lock is +used fairly often: otherwise, you wouldn't be concerned about +efficiency. +</para> + <para> +Concurrency depends on how long the lock is usually held: you should +hold the lock for as long as needed, but no longer. In the cache +example, we always create the object without the lock held, and then +grab the lock only when we are ready to insert it in the list. +</para> + <para> +Acquisition times depend on how much damage the lock operations do to +the pipeline (pipeline stalls) and how likely it is that this CPU was +the last one to grab the lock (ie. is the lock cache-hot for this +CPU): on a machine with more CPUs, this likelihood drops fast. +Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns, +an atomic increment takes about 58ns, a lock which is cache-hot on +this CPU takes 160ns, and a cacheline transfer from another CPU takes +an additional 170 to 360ns. (These figures from Paul McKenney's +<ulink url="http://www.linuxjournal.com/article.php?sid=6993"> Linux +Journal RCU article</ulink>). +</para> + <para> +These two aims conflict: holding a lock for a short time might be done +by splitting locks into parts (such as in our final per-object-lock +example), but this increases the number of lock acquisitions, and the +results are often slower than having a single lock. This is another +reason to advocate locking simplicity. +</para> + <para> +The third concern is addressed below: there are some methods to reduce +the amount of locking which needs to be done. +</para> + + <sect1 id="efficiency-rwlocks"> + <title>Read/Write Lock Variants</title> + + <para> + Both spinlocks and semaphores have read/write variants: + <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>. + These divide users into two classes: the readers and the writers. If + you are only reading the data, you can get a read lock, but to write to + the data you need the write lock. Many people can hold a read lock, + but a writer must be sole holder. + </para> + + <para> + If your code divides neatly along reader/writer lines (as our + cache code does), and the lock is held by readers for + significant lengths of time, using these locks can help. They + are slightly slower than the normal locks though, so in practice + <type>rwlock_t</type> is not usually worthwhile. + </para> + </sect1> + + <sect1 id="efficiency-read-copy-update"> + <title>Avoiding Locks: Read Copy Update</title> + + <para> + There is a special method of read/write locking called Read Copy + Update. Using RCU, the readers can avoid taking a lock + altogether: as we expect our cache to be read more often than + updated (otherwise the cache is a waste of time), it is a + candidate for this optimization. + </para> + + <para> + How do we get rid of read locks? Getting rid of read locks + means that writers may be changing the list underneath the + readers. That is actually quite simple: we can read a linked + list while an element is being added if the writer adds the + element very carefully. For example, adding + <symbol>new</symbol> to a single linked list called + <symbol>list</symbol>: + </para> + + <programlisting> + new->next = list->next; + wmb(); + list->next = new; + </programlisting> + + <para> + The <function>wmb()</function> is a write memory barrier. It + ensures that the first operation (setting the new element's + <symbol>next</symbol> pointer) is complete and will be seen by + all CPUs, before the second operation is (putting the new + element into the list). This is important, since modern + compilers and modern CPUs can both reorder instructions unless + told otherwise: we want a reader to either not see the new + element at all, or see the new element with the + <symbol>next</symbol> pointer correctly pointing at the rest of + the list. + </para> + <para> + Fortunately, there is a function to do this for standard + <structname>struct list_head</structname> lists: + <function>list_add_rcu()</function> + (<filename>include/linux/list.h</filename>). + </para> + <para> + Removing an element from the list is even simpler: we replace + the pointer to the old element with a pointer to its successor, + and readers will either see it, or skip over it. + </para> + <programlisting> + list->next = old->next; + </programlisting> + <para> + There is <function>list_del_rcu()</function> + (<filename>include/linux/list.h</filename>) which does this (the + normal version poisons the old object, which we don't want). + </para> + <para> + The reader must also be careful: some CPUs can look through the + <symbol>next</symbol> pointer to start reading the contents of + the next element early, but don't realize that the pre-fetched + contents is wrong when the <symbol>next</symbol> pointer changes + underneath them. Once again, there is a + <function>list_for_each_entry_rcu()</function> + (<filename>include/linux/list.h</filename>) to help you. Of + course, writers can just use + <function>list_for_each_entry()</function>, since there cannot + be two simultaneous writers. + </para> + <para> + Our final dilemma is this: when can we actually destroy the + removed element? Remember, a reader might be stepping through + this element in the list right now: it we free this element and + the <symbol>next</symbol> pointer changes, the reader will jump + off into garbage and crash. We need to wait until we know that + all the readers who were traversing the list when we deleted the + element are finished. We use <function>call_rcu()</function> to + register a callback which will actually destroy the object once + the readers are finished. + </para> + <para> + But how does Read Copy Update know when the readers are + finished? The method is this: firstly, the readers always + traverse the list inside + <function>rcu_read_lock()</function>/<function>rcu_read_unlock()</function> + pairs: these simply disable preemption so the reader won't go to + sleep while reading the list. + </para> + <para> + RCU then waits until every other CPU has slept at least once: + since readers cannot sleep, we know that any readers which were + traversing the list during the deletion are finished, and the + callback is triggered. The real Read Copy Update code is a + little more optimized than this, but this is the fundamental + idea. + </para> + +<programlisting> +--- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 ++++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 +@@ -1,15 +1,18 @@ + #include <linux/list.h> + #include <linux/slab.h> + #include <linux/string.h> ++#include <linux/rcupdate.h> + #include <asm/semaphore.h> + #include <asm/errno.h> + + struct object + { +- /* These two protected by cache_lock. */ ++ /* This is protected by RCU */ + struct list_head list; + int popularity; + ++ struct rcu_head rcu; ++ + atomic_t refcnt; + + /* Doesn't change once created. */ +@@ -40,7 +43,7 @@ + { + struct object *i; + +- list_for_each_entry(i, &cache, list) { ++ list_for_each_entry_rcu(i, &cache, list) { + if (i->id == id) { + i->popularity++; + return i; +@@ -49,19 +52,25 @@ + return NULL; + } + ++/* Final discard done once we know no readers are looking. */ ++static void cache_delete_rcu(void *arg) ++{ ++ object_put(arg); ++} ++ + /* Must be holding cache_lock */ + static void __cache_delete(struct object *obj) + { + BUG_ON(!obj); +- list_del(&obj->list); +- object_put(obj); ++ list_del_rcu(&obj->list); + cache_num--; ++ call_rcu(&obj->rcu, cache_delete_rcu, obj); + } + + /* Must be holding cache_lock */ + static void __cache_add(struct object *obj) + { +- list_add(&obj->list, &cache); ++ list_add_rcu(&obj->list, &cache); + if (++cache_num > MAX_CACHE_SIZE) { + struct object *i, *outcast = NULL; + list_for_each_entry(i, &cache, list) { +@@ -85,6 +94,7 @@ + obj->popularity = 0; + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ + spin_lock_init(&obj->lock); ++ INIT_RCU_HEAD(&obj->rcu); + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -104,12 +114,11 @@ + struct object *cache_find(int id) + { + struct object *obj; +- unsigned long flags; + +- spin_lock_irqsave(&cache_lock, flags); ++ rcu_read_lock(); + obj = __cache_find(id); + if (obj) + object_get(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ rcu_read_unlock(); + return obj; + } +</programlisting> + +<para> +Note that the reader will alter the +<structfield>popularity</structfield> member in +<function>__cache_find()</function>, and now it doesn't hold a lock. +One solution would be to make it an <type>atomic_t</type>, but for +this usage, we don't really care about races: an approximate result is +good enough, so I didn't change it. +</para> + +<para> +The result is that <function>cache_find()</function> requires no +synchronization with any other functions, so is almost as fast on SMP +as it would be on UP. +</para> + +<para> +There is a furthur optimization possible here: remember our original +cache code, where there were no reference counts and the caller simply +held the lock whenever using the object? This is still possible: if +you hold the lock, noone can delete the object, so you don't need to +get and put the reference count. +</para> + +<para> +Now, because the 'read lock' in RCU is simply disabling preemption, a +caller which always has preemption disabled between calling +<function>cache_find()</function> and +<function>object_put()</function> does not need to actually get and +put the reference count: we could expose +<function>__cache_find()</function> by making it non-static, and +such callers could simply call that. +</para> +<para> +The benefit here is that the reference count is not written to: the +object is not altered in any way, which is much faster on SMP +machines due to caching. +</para> + </sect1> + + <sect1 id="per-cpu"> + <title>Per-CPU Data</title> + + <para> + Another technique for avoiding locking which is used fairly + widely is to duplicate information for each CPU. For example, + if you wanted to keep a count of a common condition, you could + use a spin lock and a single counter. Nice and simple. + </para> + + <para> + If that was too slow (it's usually not, but if you've got a + really big machine to test on and can show that it is), you + could instead use a counter for each CPU, then none of them need + an exclusive lock. See <function>DEFINE_PER_CPU()</function>, + <function>get_cpu_var()</function> and + <function>put_cpu_var()</function> + (<filename class="headerfile">include/linux/percpu.h</filename>). + </para> + + <para> + Of particular use for simple per-cpu counters is the + <type>local_t</type> type, and the + <function>cpu_local_inc()</function> and related functions, + which are more efficient than simple code on some architectures + (<filename class="headerfile">include/asm/local.h</filename>). + </para> + + <para> + Note that there is no simple, reliable way of getting an exact + value of such a counter, without introducing more locks. This + is not a problem for some uses. + </para> + </sect1> + + <sect1 id="mostly-hardirq"> + <title>Data Which Mostly Used By An IRQ Handler</title> + + <para> + If data is always accessed from within the same IRQ handler, you + don't need a lock at all: the kernel already guarantees that the + irq handler will not run simultaneously on multiple CPUs. + </para> + <para> + Manfred Spraul points out that you can still do this, even if + the data is very occasionally accessed in user context or + softirqs/tasklets. The irq handler doesn't use a lock, and + all other accesses are done as so: + </para> + +<programlisting> + spin_lock(&lock); + disable_irq(irq); + ... + enable_irq(irq); + spin_unlock(&lock); +</programlisting> + <para> + The <function>disable_irq()</function> prevents the irq handler + from running (and waits for it to finish if it's currently + running on other CPUs). The spinlock prevents any other + accesses happening at the same time. Naturally, this is slower + than just a <function>spin_lock_irq()</function> call, so it + only makes sense if this type of access happens extremely + rarely. + </para> + </sect1> + </chapter> + + <chapter id="sleeping-things"> + <title>What Functions Are Safe To Call From Interrupts?</title> + + <para> + Many functions in the kernel sleep (ie. call schedule()) + directly or indirectly: you can never call them while holding a + spinlock, or with preemption disabled. This also means you need + to be in user context: calling them from an interrupt is illegal. + </para> + + <sect1 id="sleeping"> + <title>Some Functions Which Sleep</title> + + <para> + The most common ones are listed below, but you usually have to + read the code to find out if other calls are safe. If everyone + else who calls it can sleep, you probably need to be able to + sleep, too. In particular, registration and deregistration + functions usually expect to be called from user context, and can + sleep. + </para> + + <itemizedlist> + <listitem> + <para> + Accesses to + <firstterm linkend="gloss-userspace">userspace</firstterm>: + </para> + <itemizedlist> + <listitem> + <para> + <function>copy_from_user()</function> + </para> + </listitem> + <listitem> + <para> + <function>copy_to_user()</function> + </para> + </listitem> + <listitem> + <para> + <function>get_user()</function> + </para> + </listitem> + <listitem> + <para> + <function> put_user()</function> + </para> + </listitem> + </itemizedlist> + </listitem> + + <listitem> + <para> + <function>kmalloc(GFP_KERNEL)</function> + </para> + </listitem> + + <listitem> + <para> + <function>down_interruptible()</function> and + <function>down()</function> + </para> + <para> + There is a <function>down_trylock()</function> which can be + used inside interrupt context, as it will not sleep. + <function>up()</function> will also never sleep. + </para> + </listitem> + </itemizedlist> + </sect1> + + <sect1 id="dont-sleep"> + <title>Some Functions Which Don't Sleep</title> + + <para> + Some functions are safe to call from any context, or holding + almost any lock. + </para> + + <itemizedlist> + <listitem> + <para> + <function>printk()</function> + </para> + </listitem> + <listitem> + <para> + <function>kfree()</function> + </para> + </listitem> + <listitem> + <para> + <function>add_timer()</function> and <function>del_timer()</function> + </para> + </listitem> + </itemizedlist> + </sect1> + </chapter> + + <chapter id="references"> + <title>Further reading</title> + + <itemizedlist> + <listitem> + <para> + <filename>Documentation/spinlocks.txt</filename>: + Linus Torvalds' spinlocking tutorial in the kernel sources. + </para> + </listitem> + + <listitem> + <para> + Unix Systems for Modern Architectures: Symmetric + Multiprocessing and Caching for Kernel Programmers: + </para> + + <para> + Curt Schimmel's very good introduction to kernel level + locking (not written for Linux, but nearly everything + applies). The book is expensive, but really worth every + penny to understand SMP locking. [ISBN: 0201633388] + </para> + </listitem> + </itemizedlist> + </chapter> + + <chapter id="thanks"> + <title>Thanks</title> + + <para> + Thanks to Telsa Gwynne for DocBooking, neatening and adding + style. + </para> + + <para> + Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul + Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim + Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney, + John Ashby for proofreading, correcting, flaming, commenting. + </para> + + <para> + Thanks to the cabal for having no influence on this document. + </para> + </chapter> + + <glossary id="glossary"> + <title>Glossary</title> + + <glossentry id="gloss-preemption"> + <glossterm>preemption</glossterm> + <glossdef> + <para> + Prior to 2.5, or when <symbol>CONFIG_PREEMPT</symbol> is + unset, processes in user context inside the kernel would not + preempt each other (ie. you had that CPU until you have it up, + except for interrupts). With the addition of + <symbol>CONFIG_PREEMPT</symbol> in 2.5.4, this changed: when + in user context, higher priority tasks can "cut in": spinlocks + were changed to disable preemption, even on UP. + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-bh"> + <glossterm>bh</glossterm> + <glossdef> + <para> + Bottom Half: for historical reasons, functions with + '_bh' in them often now refer to any software interrupt, e.g. + <function>spin_lock_bh()</function> blocks any software interrupt + on the current CPU. Bottom halves are deprecated, and will + eventually be replaced by tasklets. Only one bottom half will be + running at any time. + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-hwinterrupt"> + <glossterm>Hardware Interrupt / Hardware IRQ</glossterm> + <glossdef> + <para> + Hardware interrupt request. <function>in_irq()</function> returns + <returnvalue>true</returnvalue> in a hardware interrupt handler. + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-interruptcontext"> + <glossterm>Interrupt Context</glossterm> + <glossdef> + <para> + Not user context: processing a hardware irq or software irq. + Indicated by the <function>in_interrupt()</function> macro + returning <returnvalue>true</returnvalue>. + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-smp"> + <glossterm><acronym>SMP</acronym></glossterm> + <glossdef> + <para> + Symmetric Multi-Processor: kernels compiled for multiple-CPU + machines. (CONFIG_SMP=y). + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-softirq"> + <glossterm>Software Interrupt / softirq</glossterm> + <glossdef> + <para> + Software interrupt handler. <function>in_irq()</function> returns + <returnvalue>false</returnvalue>; <function>in_softirq()</function> + returns <returnvalue>true</returnvalue>. Tasklets and softirqs + both fall into the category of 'software interrupts'. + </para> + <para> + Strictly speaking a softirq is one of up to 32 enumerated software + interrupts which can run on multiple CPUs at once. + Sometimes used to refer to tasklets as + well (ie. all software interrupts). + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-tasklet"> + <glossterm>tasklet</glossterm> + <glossdef> + <para> + A dynamically-registrable software interrupt, + which is guaranteed to only run on one CPU at a time. + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-timers"> + <glossterm>timer</glossterm> + <glossdef> + <para> + A dynamically-registrable software interrupt, which is run at + (or close to) a given time. When running, it is just like a + tasklet (in fact, they are called from the TIMER_SOFTIRQ). + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-up"> + <glossterm><acronym>UP</acronym></glossterm> + <glossdef> + <para> + Uni-Processor: Non-SMP. (CONFIG_SMP=n). + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-usercontext"> + <glossterm>User Context</glossterm> + <glossdef> + <para> + The kernel executing on behalf of a particular process (ie. a + system call or trap) or kernel thread. You can tell which + process with the <symbol>current</symbol> macro.) Not to + be confused with userspace. Can be interrupted by software or + hardware interrupts. + </para> + </glossdef> + </glossentry> + + <glossentry id="gloss-userspace"> + <glossterm>Userspace</glossterm> + <glossdef> + <para> + A process executing its own code outside the kernel. + </para> + </glossdef> + </glossentry> + + </glossary> +</book> + diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl new file mode 100644 index 000000000000..cf2fce7707da --- /dev/null +++ b/Documentation/DocBook/libata.tmpl @@ -0,0 +1,282 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="libataDevGuide"> + <bookinfo> + <title>libATA Developer's Guide</title> + + <authorgroup> + <author> + <firstname>Jeff</firstname> + <surname>Garzik</surname> + </author> + </authorgroup> + + <copyright> + <year>2003</year> + <holder>Jeff Garzik</holder> + </copyright> + + <legalnotice> + <para> + The contents of this file are subject to the Open + Software License version 1.1 that can be found at + <ulink url="http://www.opensource.org/licenses/osl-1.1.txt">http://www.opensource.org/licenses/osl-1.1.txt</ulink> and is included herein + by reference. + </para> + + <para> + Alternatively, the contents of this file may be used under the terms + of the GNU General Public License version 2 (the "GPL") as distributed + in the kernel source COPYING file, in which case the provisions of + the GPL are applicable instead of the above. If you wish to allow + the use of your version of this file only under the terms of the + GPL and not to allow others to use your version of this file under + the OSL, indicate your decision by deleting the provisions above and + replace them with the notice and other provisions required by the GPL. + If you do not delete the provisions above, a recipient may use your + version of this file under either the OSL or the GPL. + </para> + + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="libataThanks"> + <title>Thanks</title> + <para> + The bulk of the ATA knowledge comes thanks to long conversations with + Andre Hedrick (www.linux-ide.org). + </para> + <para> + Thanks to Alan Cox for pointing out similarities + between SATA and SCSI, and in general for motivation to hack on + libata. + </para> + <para> + libata's device detection + method, ata_pio_devchk, and in general all the early probing was + based on extensive study of Hale Landis's probe/reset code in his + ATADRVR driver (www.ata-atapi.com). + </para> + </chapter> + + <chapter id="libataDriverApi"> + <title>libata Driver API</title> + <sect1> + <title>struct ata_port_operations</title> + + <programlisting> +void (*port_disable) (struct ata_port *); + </programlisting> + + <para> + Called from ata_bus_probe() and ata_bus_reset() error paths, + as well as when unregistering from the SCSI module (rmmod, hot + unplug). + </para> + + <programlisting> +void (*dev_config) (struct ata_port *, struct ata_device *); + </programlisting> + + <para> + Called after IDENTIFY [PACKET] DEVICE is issued to each device + found. Typically used to apply device-specific fixups prior to + issue of SET FEATURES - XFER MODE, and prior to operation. + </para> + + <programlisting> +void (*set_piomode) (struct ata_port *, struct ata_device *); +void (*set_dmamode) (struct ata_port *, struct ata_device *); +void (*post_set_mode) (struct ata_port *ap); + </programlisting> + + <para> + Hooks called prior to the issue of SET FEATURES - XFER MODE + command. dev->pio_mode is guaranteed to be valid when + ->set_piomode() is called, and dev->dma_mode is guaranteed to be + valid when ->set_dmamode() is called. ->post_set_mode() is + called unconditionally, after the SET FEATURES - XFER MODE + command completes successfully. + </para> + + <para> + ->set_piomode() is always called (if present), but + ->set_dma_mode() is only called if DMA is possible. + </para> + + <programlisting> +void (*tf_load) (struct ata_port *ap, struct ata_taskfile *tf); +void (*tf_read) (struct ata_port *ap, struct ata_taskfile *tf); + </programlisting> + + <para> + ->tf_load() is called to load the given taskfile into hardware + registers / DMA buffers. ->tf_read() is called to read the + hardware registers / DMA buffers, to obtain the current set of + taskfile register values. + </para> + + <programlisting> +void (*exec_command)(struct ata_port *ap, struct ata_taskfile *tf); + </programlisting> + + <para> + causes an ATA command, previously loaded with + ->tf_load(), to be initiated in hardware. + </para> + + <programlisting> +u8 (*check_status)(struct ata_port *ap); +void (*dev_select)(struct ata_port *ap, unsigned int device); + </programlisting> + + <para> + Reads the Status ATA shadow register from hardware. On some + hardware, this has the side effect of clearing the interrupt + condition. + </para> + + <programlisting> +void (*dev_select)(struct ata_port *ap, unsigned int device); + </programlisting> + + <para> + Issues the low-level hardware command(s) that causes one of N + hardware devices to be considered 'selected' (active and + available for use) on the ATA bus. + </para> + + <programlisting> +void (*phy_reset) (struct ata_port *ap); + </programlisting> + + <para> + The very first step in the probe phase. Actions vary depending + on the bus type, typically. After waking up the device and probing + for device presence (PATA and SATA), typically a soft reset + (SRST) will be performed. Drivers typically use the helper + functions ata_bus_reset() or sata_phy_reset() for this hook. + </para> + + <programlisting> +void (*bmdma_setup) (struct ata_queued_cmd *qc); +void (*bmdma_start) (struct ata_queued_cmd *qc); + </programlisting> + + <para> + When setting up an IDE BMDMA transaction, these hooks arm + (->bmdma_setup) and fire (->bmdma_start) the hardware's DMA + engine. + </para> + + <programlisting> +void (*qc_prep) (struct ata_queued_cmd *qc); +int (*qc_issue) (struct ata_queued_cmd *qc); + </programlisting> + + <para> + Higher-level hooks, these two hooks can potentially supercede + several of the above taskfile/DMA engine hooks. ->qc_prep is + called after the buffers have been DMA-mapped, and is typically + used to populate the hardware's DMA scatter-gather table. + Most drivers use the standard ata_qc_prep() helper function, but + more advanced drivers roll their own. + </para> + <para> + ->qc_issue is used to make a command active, once the hardware + and S/G tables have been prepared. IDE BMDMA drivers use the + helper function ata_qc_issue_prot() for taskfile protocol-based + dispatch. More advanced drivers roll their own ->qc_issue + implementation, using this as the "issue new ATA command to + hardware" hook. + </para> + + <programlisting> +void (*eng_timeout) (struct ata_port *ap); + </programlisting> + + <para> + This is a high level error handling function, called from the + error handling thread, when a command times out. + </para> + + <programlisting> +irqreturn_t (*irq_handler)(int, void *, struct pt_regs *); +void (*irq_clear) (struct ata_port *); + </programlisting> + + <para> + ->irq_handler is the interrupt handling routine registered with + the system, by libata. ->irq_clear is called during probe just + before the interrupt handler is registered, to be sure hardware + is quiet. + </para> + + <programlisting> +u32 (*scr_read) (struct ata_port *ap, unsigned int sc_reg); +void (*scr_write) (struct ata_port *ap, unsigned int sc_reg, + u32 val); + </programlisting> + + <para> + Read and write standard SATA phy registers. Currently only used + if ->phy_reset hook called the sata_phy_reset() helper function. + </para> + + <programlisting> +int (*port_start) (struct ata_port *ap); +void (*port_stop) (struct ata_port *ap); +void (*host_stop) (struct ata_host_set *host_set); + </programlisting> + + <para> + ->port_start() is called just after the data structures for each + port are initialized. Typically this is used to alloc per-port + DMA buffers / tables / rings, enable DMA engines, and similar + tasks. + </para> + <para> + ->host_stop() is called when the rmmod or hot unplug process + begins. The hook must stop all hardware interrupts, DMA + engines, etc. + </para> + <para> + ->port_stop() is called after ->host_stop(). It's sole function + is to release DMA/memory resources, now that they are no longer + actively being used. + </para> + + </sect1> + </chapter> + + <chapter id="libataExt"> + <title>libata Library</title> +!Edrivers/scsi/libata-core.c + </chapter> + + <chapter id="libataInt"> + <title>libata Core Internals</title> +!Idrivers/scsi/libata-core.c + </chapter> + + <chapter id="libataScsiInt"> + <title>libata SCSI translation/emulation</title> +!Edrivers/scsi/libata-scsi.c +!Idrivers/scsi/libata-scsi.c + </chapter> + + <chapter id="PiixInt"> + <title>ata_piix Internals</title> +!Idrivers/scsi/ata_piix.c + </chapter> + + <chapter id="SILInt"> + <title>sata_sil Internals</title> +!Idrivers/scsi/sata_sil.c + </chapter> + +</book> diff --git a/Documentation/DocBook/librs.tmpl b/Documentation/DocBook/librs.tmpl new file mode 100644 index 000000000000..3ff39bafc00e --- /dev/null +++ b/Documentation/DocBook/librs.tmpl @@ -0,0 +1,289 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="Reed-Solomon-Library-Guide"> + <bookinfo> + <title>Reed-Solomon Library Programming Interface</title> + + <authorgroup> + <author> + <firstname>Thomas</firstname> + <surname>Gleixner</surname> + <affiliation> + <address> + <email>tglx@linutronix.de</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2004</year> + <holder>Thomas Gleixner</holder> + </copyright> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License version 2 as published by the Free Software Foundation. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="intro"> + <title>Introduction</title> + <para> + The generic Reed-Solomon Library provides encoding, decoding + and error correction functions. + </para> + <para> + Reed-Solomon codes are used in communication and storage + applications to ensure data integrity. + </para> + <para> + This documentation is provided for developers who want to utilize + the functions provided by the library. + </para> + </chapter> + + <chapter id="bugs"> + <title>Known Bugs And Assumptions</title> + <para> + None. + </para> + </chapter> + + <chapter id="usage"> + <title>Usage</title> + <para> + This chapter provides examples how to use the library. + </para> + <sect1> + <title>Initializing</title> + <para> + The init function init_rs returns a pointer to a + rs decoder structure, which holds the necessary + information for encoding, decoding and error correction + with the given polynomial. It either uses an existing + matching decoder or creates a new one. On creation all + the lookup tables for fast en/decoding are created. + The function may take a while, so make sure not to + call it in critical code paths. + </para> + <programlisting> +/* the Reed Solomon control structure */ +static struct rs_control *rs_decoder; + +/* Symbolsize is 10 (bits) + * Primitve polynomial is x^10+x^3+1 + * first consecutive root is 0 + * primitve element to generate roots = 1 + * generator polinomial degree (number of roots) = 6 + */ +rs_decoder = init_rs (10, 0x409, 0, 1, 6); + </programlisting> + </sect1> + <sect1> + <title>Encoding</title> + <para> + The encoder calculates the Reed-Solomon code over + the given data length and stores the result in + the parity buffer. Note that the parity buffer must + be initialized before calling the encoder. + </para> + <para> + The expanded data can be inverted on the fly by + providing a non zero inversion mask. The expanded data is + XOR'ed with the mask. This is used e.g. for FLASH + ECC, where the all 0xFF is inverted to an all 0x00. + The Reed-Solomon code for all 0x00 is all 0x00. The + code is inverted before storing to FLASH so it is 0xFF + too. This prevent's that reading from an erased FLASH + results in ECC errors. + </para> + <para> + The databytes are expanded to the given symbol size + on the fly. There is no support for encoding continuous + bitstreams with a symbol size != 8 at the moment. If + it is necessary it should be not a big deal to implement + such functionality. + </para> + <programlisting> +/* Parity buffer. Size = number of roots */ +uint16_t par[6]; +/* Initialize the parity buffer */ +memset(par, 0, sizeof(par)); +/* Encode 512 byte in data8. Store parity in buffer par */ +encode_rs8 (rs_decoder, data8, 512, par, 0); + </programlisting> + </sect1> + <sect1> + <title>Decoding</title> + <para> + The decoder calculates the syndrome over + the given data length and the received parity symbols + and corrects errors in the data. + </para> + <para> + If a syndrome is available from a hardware decoder + then the syndrome calculation is skipped. + </para> + <para> + The correction of the data buffer can be suppressed + by providing a correction pattern buffer and an error + location buffer to the decoder. The decoder stores the + calculated error location and the correction bitmask + in the given buffers. This is useful for hardware + decoders which use a weird bit ordering scheme. + </para> + <para> + The databytes are expanded to the given symbol size + on the fly. There is no support for decoding continuous + bitstreams with a symbolsize != 8 at the moment. If + it is necessary it should be not a big deal to implement + such functionality. + </para> + + <sect2> + <title> + Decoding with syndrome calculation, direct data correction + </title> + <programlisting> +/* Parity buffer. Size = number of roots */ +uint16_t par[6]; +uint8_t data[512]; +int numerr; +/* Receive data */ +..... +/* Receive parity */ +..... +/* Decode 512 byte in data8.*/ +numerr = decode_rs8 (rs_decoder, data8, par, 512, NULL, 0, NULL, 0, NULL); + </programlisting> + </sect2> + + <sect2> + <title> + Decoding with syndrome given by hardware decoder, direct data correction + </title> + <programlisting> +/* Parity buffer. Size = number of roots */ +uint16_t par[6], syn[6]; +uint8_t data[512]; +int numerr; +/* Receive data */ +..... +/* Receive parity */ +..... +/* Get syndrome from hardware decoder */ +..... +/* Decode 512 byte in data8.*/ +numerr = decode_rs8 (rs_decoder, data8, par, 512, syn, 0, NULL, 0, NULL); + </programlisting> + </sect2> + + <sect2> + <title> + Decoding with syndrome given by hardware decoder, no direct data correction. + </title> + <para> + Note: It's not necessary to give data and received parity to the decoder. + </para> + <programlisting> +/* Parity buffer. Size = number of roots */ +uint16_t par[6], syn[6], corr[8]; +uint8_t data[512]; +int numerr, errpos[8]; +/* Receive data */ +..... +/* Receive parity */ +..... +/* Get syndrome from hardware decoder */ +..... +/* Decode 512 byte in data8.*/ +numerr = decode_rs8 (rs_decoder, NULL, NULL, 512, syn, 0, errpos, 0, corr); +for (i = 0; i < numerr; i++) { + do_error_correction_in_your_buffer(errpos[i], corr[i]); +} + </programlisting> + </sect2> + </sect1> + <sect1> + <title>Cleanup</title> + <para> + The function free_rs frees the allocated resources, + if the caller is the last user of the decoder. + </para> + <programlisting> +/* Release resources */ +free_rs(rs_decoder); + </programlisting> + </sect1> + + </chapter> + + <chapter id="structs"> + <title>Structures</title> + <para> + This chapter contains the autogenerated documentation of the structures which are + used in the Reed-Solomon Library and are relevant for a developer. + </para> +!Iinclude/linux/rslib.h + </chapter> + + <chapter id="pubfunctions"> + <title>Public Functions Provided</title> + <para> + This chapter contains the autogenerated documentation of the Reed-Solomon functions + which are exported. + </para> +!Elib/reed_solomon/reed_solomon.c + </chapter> + + <chapter id="credits"> + <title>Credits</title> + <para> + The library code for encoding and decoding was written by Phil Karn. + </para> + <programlisting> + Copyright 2002, Phil Karn, KA9Q + May be used under the terms of the GNU General Public License (GPL) + </programlisting> + <para> + The wrapper functions and interfaces are written by Thomas Gleixner + </para> + <para> + Many users have provided bugfixes, improvements and helping hands for testing. + Thanks a lot. + </para> + <para> + The following people have contributed to this document: + </para> + <para> + Thomas Gleixner<email>tglx@linutronix.de</email> + </para> + </chapter> +</book> diff --git a/Documentation/DocBook/lsm.tmpl b/Documentation/DocBook/lsm.tmpl new file mode 100644 index 000000000000..f63822195871 --- /dev/null +++ b/Documentation/DocBook/lsm.tmpl @@ -0,0 +1,265 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<article class="whitepaper" id="LinuxSecurityModule" lang="en"> + <articleinfo> + <title>Linux Security Modules: General Security Hooks for Linux</title> + <authorgroup> + <author> + <firstname>Stephen</firstname> + <surname>Smalley</surname> + <affiliation> + <orgname>NAI Labs</orgname> + <address><email>ssmalley@nai.com</email></address> + </affiliation> + </author> + <author> + <firstname>Timothy</firstname> + <surname>Fraser</surname> + <affiliation> + <orgname>NAI Labs</orgname> + <address><email>tfraser@nai.com</email></address> + </affiliation> + </author> + <author> + <firstname>Chris</firstname> + <surname>Vance</surname> + <affiliation> + <orgname>NAI Labs</orgname> + <address><email>cvance@nai.com</email></address> + </affiliation> + </author> + </authorgroup> + </articleinfo> + +<sect1><title>Introduction</title> + +<para> +In March 2001, the National Security Agency (NSA) gave a presentation +about Security-Enhanced Linux (SELinux) at the 2.5 Linux Kernel +Summit. SELinux is an implementation of flexible and fine-grained +nondiscretionary access controls in the Linux kernel, originally +implemented as its own particular kernel patch. Several other +security projects (e.g. RSBAC, Medusa) have also developed flexible +access control architectures for the Linux kernel, and various +projects have developed particular access control models for Linux +(e.g. LIDS, DTE, SubDomain). Each project has developed and +maintained its own kernel patch to support its security needs. +</para> + +<para> +In response to the NSA presentation, Linus Torvalds made a set of +remarks that described a security framework he would be willing to +consider for inclusion in the mainstream Linux kernel. He described a +general framework that would provide a set of security hooks to +control operations on kernel objects and a set of opaque security +fields in kernel data structures for maintaining security attributes. +This framework could then be used by loadable kernel modules to +implement any desired model of security. Linus also suggested the +possibility of migrating the Linux capabilities code into such a +module. +</para> + +<para> +The Linux Security Modules (LSM) project was started by WireX to +develop such a framework. LSM is a joint development effort by +several security projects, including Immunix, SELinux, SGI and Janus, +and several individuals, including Greg Kroah-Hartman and James +Morris, to develop a Linux kernel patch that implements this +framework. The patch is currently tracking the 2.4 series and is +targeted for integration into the 2.5 development series. This +technical report provides an overview of the framework and the example +capabilities security module provided by the LSM kernel patch. +</para> + +</sect1> + +<sect1 id="framework"><title>LSM Framework</title> + +<para> +The LSM kernel patch provides a general kernel framework to support +security modules. In particular, the LSM framework is primarily +focused on supporting access control modules, although future +development is likely to address other security needs such as +auditing. By itself, the framework does not provide any additional +security; it merely provides the infrastructure to support security +modules. The LSM kernel patch also moves most of the capabilities +logic into an optional security module, with the system defaulting +to the traditional superuser logic. This capabilities module +is discussed further in <xref linkend="cap"/>. +</para> + +<para> +The LSM kernel patch adds security fields to kernel data structures +and inserts calls to hook functions at critical points in the kernel +code to manage the security fields and to perform access control. It +also adds functions for registering and unregistering security +modules, and adds a general <function>security</function> system call +to support new system calls for security-aware applications. +</para> + +<para> +The LSM security fields are simply <type>void*</type> pointers. For +process and program execution security information, security fields +were added to <structname>struct task_struct</structname> and +<structname>struct linux_binprm</structname>. For filesystem security +information, a security field was added to +<structname>struct super_block</structname>. For pipe, file, and socket +security information, security fields were added to +<structname>struct inode</structname> and +<structname>struct file</structname>. For packet and network device security +information, security fields were added to +<structname>struct sk_buff</structname> and +<structname>struct net_device</structname>. For System V IPC security +information, security fields were added to +<structname>struct kern_ipc_perm</structname> and +<structname>struct msg_msg</structname>; additionally, the definitions +for <structname>struct msg_msg</structname>, <structname>struct +msg_queue</structname>, and <structname>struct +shmid_kernel</structname> were moved to header files +(<filename>include/linux/msg.h</filename> and +<filename>include/linux/shm.h</filename> as appropriate) to allow +the security modules to use these definitions. +</para> + +<para> +Each LSM hook is a function pointer in a global table, +security_ops. This table is a +<structname>security_operations</structname> structure as defined by +<filename>include/linux/security.h</filename>. Detailed documentation +for each hook is included in this header file. At present, this +structure consists of a collection of substructures that group related +hooks based on the kernel object (e.g. task, inode, file, sk_buff, +etc) as well as some top-level hook function pointers for system +operations. This structure is likely to be flattened in the future +for performance. The placement of the hook calls in the kernel code +is described by the "called:" lines in the per-hook documentation in +the header file. The hook calls can also be easily found in the +kernel code by looking for the string "security_ops->". + +</para> + +<para> +Linus mentioned per-process security hooks in his original remarks as a +possible alternative to global security hooks. However, if LSM were +to start from the perspective of per-process hooks, then the base +framework would have to deal with how to handle operations that +involve multiple processes (e.g. kill), since each process might have +its own hook for controlling the operation. This would require a +general mechanism for composing hooks in the base framework. +Additionally, LSM would still need global hooks for operations that +have no process context (e.g. network input operations). +Consequently, LSM provides global security hooks, but a security +module is free to implement per-process hooks (where that makes sense) +by storing a security_ops table in each process' security field and +then invoking these per-process hooks from the global hooks. +The problem of composition is thus deferred to the module. +</para> + +<para> +The global security_ops table is initialized to a set of hook +functions provided by a dummy security module that provides +traditional superuser logic. A <function>register_security</function> +function (in <filename>security/security.c</filename>) is provided to +allow a security module to set security_ops to refer to its own hook +functions, and an <function>unregister_security</function> function is +provided to revert security_ops to the dummy module hooks. This +mechanism is used to set the primary security module, which is +responsible for making the final decision for each hook. +</para> + +<para> +LSM also provides a simple mechanism for stacking additional security +modules with the primary security module. It defines +<function>register_security</function> and +<function>unregister_security</function> hooks in the +<structname>security_operations</structname> structure and provides +<function>mod_reg_security</function> and +<function>mod_unreg_security</function> functions that invoke these +hooks after performing some sanity checking. A security module can +call these functions in order to stack with other modules. However, +the actual details of how this stacking is handled are deferred to the +module, which can implement these hooks in any way it wishes +(including always returning an error if it does not wish to support +stacking). In this manner, LSM again defers the problem of +composition to the module. +</para> + +<para> +Although the LSM hooks are organized into substructures based on +kernel object, all of the hooks can be viewed as falling into two +major categories: hooks that are used to manage the security fields +and hooks that are used to perform access control. Examples of the +first category of hooks include the +<function>alloc_security</function> and +<function>free_security</function> hooks defined for each kernel data +structure that has a security field. These hooks are used to allocate +and free security structures for kernel objects. The first category +of hooks also includes hooks that set information in the security +field after allocation, such as the <function>post_lookup</function> +hook in <structname>struct inode_security_ops</structname>. This hook +is used to set security information for inodes after successful lookup +operations. An example of the second category of hooks is the +<function>permission</function> hook in +<structname>struct inode_security_ops</structname>. This hook checks +permission when accessing an inode. +</para> + +</sect1> + +<sect1 id="cap"><title>LSM Capabilities Module</title> + +<para> +The LSM kernel patch moves most of the existing POSIX.1e capabilities +logic into an optional security module stored in the file +<filename>security/capability.c</filename>. This change allows +users who do not want to use capabilities to omit this code entirely +from their kernel, instead using the dummy module for traditional +superuser logic or any other module that they desire. This change +also allows the developers of the capabilities logic to maintain and +enhance their code more freely, without needing to integrate patches +back into the base kernel. +</para> + +<para> +In addition to moving the capabilities logic, the LSM kernel patch +could move the capability-related fields from the kernel data +structures into the new security fields managed by the security +modules. However, at present, the LSM kernel patch leaves the +capability fields in the kernel data structures. In his original +remarks, Linus suggested that this might be preferable so that other +security modules can be easily stacked with the capabilities module +without needing to chain multiple security structures on the security field. +It also avoids imposing extra overhead on the capabilities module +to manage the security fields. However, the LSM framework could +certainly support such a move if it is determined to be desirable, +with only a few additional changes described below. +</para> + +<para> +At present, the capabilities logic for computing process capabilities +on <function>execve</function> and <function>set*uid</function>, +checking capabilities for a particular process, saving and checking +capabilities for netlink messages, and handling the +<function>capget</function> and <function>capset</function> system +calls have been moved into the capabilities module. There are still a +few locations in the base kernel where capability-related fields are +directly examined or modified, but the current version of the LSM +patch does allow a security module to completely replace the +assignment and testing of capabilities. These few locations would +need to be changed if the capability-related fields were moved into +the security field. The following is a list of known locations that +still perform such direct examination or modification of +capability-related fields: +<itemizedlist> +<listitem><para><filename>fs/open.c</filename>:<function>sys_access</function></para></listitem> +<listitem><para><filename>fs/lockd/host.c</filename>:<function>nlm_bind_host</function></para></listitem> +<listitem><para><filename>fs/nfsd/auth.c</filename>:<function>nfsd_setuser</function></para></listitem> +<listitem><para><filename>fs/proc/array.c</filename>:<function>task_cap</function></para></listitem> +</itemizedlist> +</para> + +</sect1> + +</article> diff --git a/Documentation/DocBook/man/Makefile b/Documentation/DocBook/man/Makefile new file mode 100644 index 000000000000..4fb7ea0f7ac8 --- /dev/null +++ b/Documentation/DocBook/man/Makefile @@ -0,0 +1,3 @@ +# Rules are put in Documentation/DocBook + +clean-files := *.9.gz *.sgml manpage.links manpage.refs diff --git a/Documentation/DocBook/mcabook.tmpl b/Documentation/DocBook/mcabook.tmpl new file mode 100644 index 000000000000..4367f4642f3d --- /dev/null +++ b/Documentation/DocBook/mcabook.tmpl @@ -0,0 +1,107 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="MCAGuide"> + <bookinfo> + <title>MCA Driver Programming Interface</title> + + <authorgroup> + <author> + <firstname>Alan</firstname> + <surname>Cox</surname> + <affiliation> + <address> + <email>alan@redhat.com</email> + </address> + </affiliation> + </author> + <author> + <firstname>David</firstname> + <surname>Weinehall</surname> + </author> + <author> + <firstname>Chris</firstname> + <surname>Beauregard</surname> + </author> + </authorgroup> + + <copyright> + <year>2000</year> + <holder>Alan Cox</holder> + <holder>David Weinehall</holder> + <holder>Chris Beauregard</holder> + </copyright> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="intro"> + <title>Introduction</title> + <para> + The MCA bus functions provide a generalised interface to find MCA + bus cards, to claim them for a driver, and to read and manipulate POS + registers without being aware of the motherboard internals or + certain deep magic specific to onboard devices. + </para> + <para> + The basic interface to the MCA bus devices is the slot. Each slot + is numbered and virtual slot numbers are assigned to the internal + devices. Using a pci_dev as other busses do does not really make + sense in the MCA context as the MCA bus resources require card + specific interpretation. + </para> + <para> + Finally the MCA bus functions provide a parallel set of DMA + functions mimicing the ISA bus DMA functions as closely as possible, + although also supporting the additional DMA functionality on the + MCA bus controllers. + </para> + </chapter> + <chapter id="bugs"> + <title>Known Bugs And Assumptions</title> + <para> + None. + </para> + </chapter> + + <chapter id="pubfunctions"> + <title>Public Functions Provided</title> +!Earch/i386/kernel/mca.c + </chapter> + + <chapter id="dmafunctions"> + <title>DMA Functions Provided</title> +!Iinclude/asm-i386/mca_dma.h + </chapter> + +</book> diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl new file mode 100644 index 000000000000..6e463d0db266 --- /dev/null +++ b/Documentation/DocBook/mtdnand.tmpl @@ -0,0 +1,1320 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> + +<book id="MTD-NAND-Guide"> + <bookinfo> + <title>MTD NAND Driver Programming Interface</title> + + <authorgroup> + <author> + <firstname>Thomas</firstname> + <surname>Gleixner</surname> + <affiliation> + <address> + <email>tglx@linutronix.de</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2004</year> + <holder>Thomas Gleixner</holder> + </copyright> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License version 2 as published by the Free Software Foundation. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="intro"> + <title>Introduction</title> + <para> + The generic NAND driver supports almost all NAND and AG-AND based + chips and connects them to the Memory Technology Devices (MTD) + subsystem of the Linux Kernel. + </para> + <para> + This documentation is provided for developers who want to implement + board drivers or filesystem drivers suitable for NAND devices. + </para> + </chapter> + + <chapter id="bugs"> + <title>Known Bugs And Assumptions</title> + <para> + None. + </para> + </chapter> + + <chapter id="dochints"> + <title>Documentation hints</title> + <para> + The function and structure docs are autogenerated. Each function and + struct member has a short description which is marked with an [XXX] identifier. + The following chapters explain the meaning of those identifiers. + </para> + <sect1> + <title>Function identifiers [XXX]</title> + <para> + The functions are marked with [XXX] identifiers in the short + comment. The identifiers explain the usage and scope of the + functions. Following identifiers are used: + </para> + <itemizedlist> + <listitem><para> + [MTD Interface]</para><para> + These functions provide the interface to the MTD kernel API. + They are not replacable and provide functionality + which is complete hardware independent. + </para></listitem> + <listitem><para> + [NAND Interface]</para><para> + These functions are exported and provide the interface to the NAND kernel API. + </para></listitem> + <listitem><para> + [GENERIC]</para><para> + Generic functions are not replacable and provide functionality + which is complete hardware independent. + </para></listitem> + <listitem><para> + [DEFAULT]</para><para> + Default functions provide hardware related functionality which is suitable + for most of the implementations. These functions can be replaced by the + board driver if neccecary. Those functions are called via pointers in the + NAND chip description structure. The board driver can set the functions which + should be replaced by board dependend functions before calling nand_scan(). + If the function pointer is NULL on entry to nand_scan() then the pointer + is set to the default function which is suitable for the detected chip type. + </para></listitem> + </itemizedlist> + </sect1> + <sect1> + <title>Struct member identifiers [XXX]</title> + <para> + The struct members are marked with [XXX] identifiers in the + comment. The identifiers explain the usage and scope of the + members. Following identifiers are used: + </para> + <itemizedlist> + <listitem><para> + [INTERN]</para><para> + These members are for NAND driver internal use only and must not be + modified. Most of these values are calculated from the chip geometry + information which is evaluated during nand_scan(). + </para></listitem> + <listitem><para> + [REPLACEABLE]</para><para> + Replaceable members hold hardware related functions which can be + provided by the board driver. The board driver can set the functions which + should be replaced by board dependend functions before calling nand_scan(). + If the function pointer is NULL on entry to nand_scan() then the pointer + is set to the default function which is suitable for the detected chip type. + </para></listitem> + <listitem><para> + [BOARDSPECIFIC]</para><para> + Board specific members hold hardware related information which must + be provided by the board driver. The board driver must set the function + pointers and datafields before calling nand_scan(). + </para></listitem> + <listitem><para> + [OPTIONAL]</para><para> + Optional members can hold information relevant for the board driver. The + generic NAND driver code does not use this information. + </para></listitem> + </itemizedlist> + </sect1> + </chapter> + + <chapter id="basicboarddriver"> + <title>Basic board driver</title> + <para> + For most boards it will be sufficient to provide just the + basic functions and fill out some really board dependend + members in the nand chip description structure. + See drivers/mtd/nand/skeleton for reference. + </para> + <sect1> + <title>Basic defines</title> + <para> + At least you have to provide a mtd structure and + a storage for the ioremap'ed chip address. + You can allocate the mtd structure using kmalloc + or you can allocate it statically. + In case of static allocation you have to allocate + a nand_chip structure too. + </para> + <para> + Kmalloc based example + </para> + <programlisting> +static struct mtd_info *board_mtd; +static unsigned long baseaddr; + </programlisting> + <para> + Static example + </para> + <programlisting> +static struct mtd_info board_mtd; +static struct nand_chip board_chip; +static unsigned long baseaddr; + </programlisting> + </sect1> + <sect1> + <title>Partition defines</title> + <para> + If you want to divide your device into parititions, then + enable the configuration switch CONFIG_MTD_PARITIONS and define + a paritioning scheme suitable to your board. + </para> + <programlisting> +#define NUM_PARTITIONS 2 +static struct mtd_partition partition_info[] = { + { .name = "Flash partition 1", + .offset = 0, + .size = 8 * 1024 * 1024 }, + { .name = "Flash partition 2", + .offset = MTDPART_OFS_NEXT, + .size = MTDPART_SIZ_FULL }, +}; + </programlisting> + </sect1> + <sect1> + <title>Hardware control function</title> + <para> + The hardware control function provides access to the + control pins of the NAND chip(s). + The access can be done by GPIO pins or by address lines. + If you use address lines, make sure that the timing + requirements are met. + </para> + <para> + <emphasis>GPIO based example</emphasis> + </para> + <programlisting> +static void board_hwcontrol(struct mtd_info *mtd, int cmd) +{ + switch(cmd){ + case NAND_CTL_SETCLE: /* Set CLE pin high */ break; + case NAND_CTL_CLRCLE: /* Set CLE pin low */ break; + case NAND_CTL_SETALE: /* Set ALE pin high */ break; + case NAND_CTL_CLRALE: /* Set ALE pin low */ break; + case NAND_CTL_SETNCE: /* Set nCE pin low */ break; + case NAND_CTL_CLRNCE: /* Set nCE pin high */ break; + } +} + </programlisting> + <para> + <emphasis>Address lines based example.</emphasis> It's assumed that the + nCE pin is driven by a chip select decoder. + </para> + <programlisting> +static void board_hwcontrol(struct mtd_info *mtd, int cmd) +{ + struct nand_chip *this = (struct nand_chip *) mtd->priv; + switch(cmd){ + case NAND_CTL_SETCLE: this->IO_ADDR_W |= CLE_ADRR_BIT; break; + case NAND_CTL_CLRCLE: this->IO_ADDR_W &= ~CLE_ADRR_BIT; break; + case NAND_CTL_SETALE: this->IO_ADDR_W |= ALE_ADRR_BIT; break; + case NAND_CTL_CLRALE: this->IO_ADDR_W &= ~ALE_ADRR_BIT; break; + } +} + </programlisting> + </sect1> + <sect1> + <title>Device ready function</title> + <para> + If the hardware interface has the ready busy pin of the NAND chip connected to a + GPIO or other accesible I/O pin, this function is used to read back the state of the + pin. The function has no arguments and should return 0, if the device is busy (R/B pin + is low) and 1, if the device is ready (R/B pin is high). + If the hardware interface does not give access to the ready busy pin, then + the function must not be defined and the function pointer this->dev_ready is set to NULL. + </para> + </sect1> + <sect1> + <title>Init function</title> + <para> + The init function allocates memory and sets up all the board + specific parameters and function pointers. When everything + is set up nand_scan() is called. This function tries to + detect and identify then chip. If a chip is found all the + internal data fields are initialized accordingly. + The structure(s) have to be zeroed out first and then filled with the neccecary + information about the device. + </para> + <programlisting> +int __init board_init (void) +{ + struct nand_chip *this; + int err = 0; + + /* Allocate memory for MTD device structure and private data */ + board_mtd = kmalloc (sizeof(struct mtd_info) + sizeof (struct nand_chip), GFP_KERNEL); + if (!board_mtd) { + printk ("Unable to allocate NAND MTD device structure.\n"); + err = -ENOMEM; + goto out; + } + + /* Initialize structures */ + memset ((char *) board_mtd, 0, sizeof(struct mtd_info) + sizeof(struct nand_chip)); + + /* map physical adress */ + baseaddr = (unsigned long)ioremap(CHIP_PHYSICAL_ADDRESS, 1024); + if(!baseaddr){ + printk("Ioremap to access NAND chip failed\n"); + err = -EIO; + goto out_mtd; + } + + /* Get pointer to private data */ + this = (struct nand_chip *) (); + /* Link the private data with the MTD structure */ + board_mtd->priv = this; + + /* Set address of NAND IO lines */ + this->IO_ADDR_R = baseaddr; + this->IO_ADDR_W = baseaddr; + /* Reference hardware control function */ + this->hwcontrol = board_hwcontrol; + /* Set command delay time, see datasheet for correct value */ + this->chip_delay = CHIP_DEPENDEND_COMMAND_DELAY; + /* Assign the device ready function, if available */ + this->dev_ready = board_dev_ready; + this->eccmode = NAND_ECC_SOFT; + + /* Scan to find existance of the device */ + if (nand_scan (board_mtd, 1)) { + err = -ENXIO; + goto out_ior; + } + + add_mtd_partitions(board_mtd, partition_info, NUM_PARTITIONS); + goto out; + +out_ior: + iounmap((void *)baseaddr); +out_mtd: + kfree (board_mtd); +out: + return err; +} +module_init(board_init); + </programlisting> + </sect1> + <sect1> + <title>Exit function</title> + <para> + The exit function is only neccecary if the driver is + compiled as a module. It releases all resources which + are held by the chip driver and unregisters the partitions + in the MTD layer. + </para> + <programlisting> +#ifdef MODULE +static void __exit board_cleanup (void) +{ + /* Release resources, unregister device */ + nand_release (board_mtd); + + /* unmap physical adress */ + iounmap((void *)baseaddr); + + /* Free the MTD device structure */ + kfree (board_mtd); +} +module_exit(board_cleanup); +#endif + </programlisting> + </sect1> + </chapter> + + <chapter id="boarddriversadvanced"> + <title>Advanced board driver functions</title> + <para> + This chapter describes the advanced functionality of the NAND + driver. For a list of functions which can be overridden by the board + driver see the documentation of the nand_chip structure. + </para> + <sect1> + <title>Multiple chip control</title> + <para> + The nand driver can control chip arrays. Therefor the + board driver must provide an own select_chip function. This + function must (de)select the requested chip. + The function pointer in the nand_chip structure must + be set before calling nand_scan(). The maxchip parameter + of nand_scan() defines the maximum number of chips to + scan for. Make sure that the select_chip function can + handle the requested number of chips. + </para> + <para> + The nand driver concatenates the chips to one virtual + chip and provides this virtual chip to the MTD layer. + </para> + <para> + <emphasis>Note: The driver can only handle linear chip arrays + of equally sized chips. There is no support for + parallel arrays which extend the buswidth.</emphasis> + </para> + <para> + <emphasis>GPIO based example</emphasis> + </para> + <programlisting> +static void board_select_chip (struct mtd_info *mtd, int chip) +{ + /* Deselect all chips, set all nCE pins high */ + GPIO(BOARD_NAND_NCE) |= 0xff; + if (chip >= 0) + GPIO(BOARD_NAND_NCE) &= ~ (1 << chip); +} + </programlisting> + <para> + <emphasis>Address lines based example.</emphasis> + Its assumed that the nCE pins are connected to an + address decoder. + </para> + <programlisting> +static void board_select_chip (struct mtd_info *mtd, int chip) +{ + struct nand_chip *this = (struct nand_chip *) mtd->priv; + + /* Deselect all chips */ + this->IO_ADDR_R &= ~BOARD_NAND_ADDR_MASK; + this->IO_ADDR_W &= ~BOARD_NAND_ADDR_MASK; + switch (chip) { + case 0: + this->IO_ADDR_R |= BOARD_NAND_ADDR_CHIP0; + this->IO_ADDR_W |= BOARD_NAND_ADDR_CHIP0; + break; + .... + case n: + this->IO_ADDR_R |= BOARD_NAND_ADDR_CHIPn; + this->IO_ADDR_W |= BOARD_NAND_ADDR_CHIPn; + break; + } +} + </programlisting> + </sect1> + <sect1> + <title>Hardware ECC support</title> + <sect2> + <title>Functions and constants</title> + <para> + The nand driver supports three different types of + hardware ECC. + <itemizedlist> + <listitem><para>NAND_ECC_HW3_256</para><para> + Hardware ECC generator providing 3 bytes ECC per + 256 byte. + </para> </listitem> + <listitem><para>NAND_ECC_HW3_512</para><para> + Hardware ECC generator providing 3 bytes ECC per + 512 byte. + </para> </listitem> + <listitem><para>NAND_ECC_HW6_512</para><para> + Hardware ECC generator providing 6 bytes ECC per + 512 byte. + </para> </listitem> + <listitem><para>NAND_ECC_HW8_512</para><para> + Hardware ECC generator providing 6 bytes ECC per + 512 byte. + </para> </listitem> + </itemizedlist> + If your hardware generator has a different functionality + add it at the appropriate place in nand_base.c + </para> + <para> + The board driver must provide following functions: + <itemizedlist> + <listitem><para>enable_hwecc</para><para> + This function is called before reading / writing to + the chip. Reset or initialize the hardware generator + in this function. The function is called with an + argument which let you distinguish between read + and write operations. + </para> </listitem> + <listitem><para>calculate_ecc</para><para> + This function is called after read / write from / to + the chip. Transfer the ECC from the hardware to + the buffer. If the option NAND_HWECC_SYNDROME is set + then the function is only called on write. See below. + </para> </listitem> + <listitem><para>correct_data</para><para> + In case of an ECC error this function is called for + error detection and correction. Return 1 respectively 2 + in case the error can be corrected. If the error is + not correctable return -1. If your hardware generator + matches the default algorithm of the nand_ecc software + generator then use the correction function provided + by nand_ecc instead of implementing duplicated code. + </para> </listitem> + </itemizedlist> + </para> + </sect2> + <sect2> + <title>Hardware ECC with syndrome calculation</title> + <para> + Many hardware ECC implementations provide Reed-Solomon + codes and calculate an error syndrome on read. The syndrome + must be converted to a standard Reed-Solomon syndrome + before calling the error correction code in the generic + Reed-Solomon library. + </para> + <para> + The ECC bytes must be placed immidiately after the data + bytes in order to make the syndrome generator work. This + is contrary to the usual layout used by software ECC. The + seperation of data and out of band area is not longer + possible. The nand driver code handles this layout and + the remaining free bytes in the oob area are managed by + the autoplacement code. Provide a matching oob-layout + in this case. See rts_from4.c and diskonchip.c for + implementation reference. In those cases we must also + use bad block tables on FLASH, because the ECC layout is + interferring with the bad block marker positions. + See bad block table support for details. + </para> + </sect2> + </sect1> + <sect1> + <title>Bad block table support</title> + <para> + Most NAND chips mark the bad blocks at a defined + position in the spare area. Those blocks must + not be erased under any circumstances as the bad + block information would be lost. + It is possible to check the bad block mark each + time when the blocks are accessed by reading the + spare area of the first page in the block. This + is time consuming so a bad block table is used. + </para> + <para> + The nand driver supports various types of bad block + tables. + <itemizedlist> + <listitem><para>Per device</para><para> + The bad block table contains all bad block information + of the device which can consist of multiple chips. + </para> </listitem> + <listitem><para>Per chip</para><para> + A bad block table is used per chip and contains the + bad block information for this particular chip. + </para> </listitem> + <listitem><para>Fixed offset</para><para> + The bad block table is located at a fixed offset + in the chip (device). This applies to various + DiskOnChip devices. + </para> </listitem> + <listitem><para>Automatic placed</para><para> + The bad block table is automatically placed and + detected either at the end or at the beginning + of a chip (device) + </para> </listitem> + <listitem><para>Mirrored tables</para><para> + The bad block table is mirrored on the chip (device) to + allow updates of the bad block table without data loss. + </para> </listitem> + </itemizedlist> + </para> + <para> + nand_scan() calls the function nand_default_bbt(). + nand_default_bbt() selects appropriate default + bad block table desriptors depending on the chip information + which was retrieved by nand_scan(). + </para> + <para> + The standard policy is scanning the device for bad + blocks and build a ram based bad block table which + allows faster access than always checking the + bad block information on the flash chip itself. + </para> + <sect2> + <title>Flash based tables</title> + <para> + It may be desired or neccecary to keep a bad block table in FLASH. + For AG-AND chips this is mandatory, as they have no factory marked + bad blocks. They have factory marked good blocks. The marker pattern + is erased when the block is erased to be reused. So in case of + powerloss before writing the pattern back to the chip this block + would be lost and added to the bad blocks. Therefor we scan the + chip(s) when we detect them the first time for good blocks and + store this information in a bad block table before erasing any + of the blocks. + </para> + <para> + The blocks in which the tables are stored are procteted against + accidental access by marking them bad in the memory bad block + table. The bad block table managment functions are allowed + to circumvernt this protection. + </para> + <para> + The simplest way to activate the FLASH based bad block table support + is to set the option NAND_USE_FLASH_BBT in the option field of + the nand chip structure before calling nand_scan(). For AG-AND + chips is this done by default. + This activates the default FLASH based bad block table functionality + of the NAND driver. The default bad block table options are + <itemizedlist> + <listitem><para>Store bad block table per chip</para></listitem> + <listitem><para>Use 2 bits per block</para></listitem> + <listitem><para>Automatic placement at the end of the chip</para></listitem> + <listitem><para>Use mirrored tables with version numbers</para></listitem> + <listitem><para>Reserve 4 blocks at the end of the chip</para></listitem> + </itemizedlist> + </para> + </sect2> + <sect2> + <title>User defined tables</title> + <para> + User defined tables are created by filling out a + nand_bbt_descr structure and storing the pointer in the + nand_chip structure member bbt_td before calling nand_scan(). + If a mirror table is neccecary a second structure must be + created and a pointer to this structure must be stored + in bbt_md inside the nand_chip structure. If the bbt_md + member is set to NULL then only the main table is used + and no scan for the mirrored table is performed. + </para> + <para> + The most important field in the nand_bbt_descr structure + is the options field. The options define most of the + table properties. Use the predefined constants from + nand.h to define the options. + <itemizedlist> + <listitem><para>Number of bits per block</para> + <para>The supported number of bits is 1, 2, 4, 8.</para></listitem> + <listitem><para>Table per chip</para> + <para>Setting the constant NAND_BBT_PERCHIP selects that + a bad block table is managed for each chip in a chip array. + If this option is not set then a per device bad block table + is used.</para></listitem> + <listitem><para>Table location is absolute</para> + <para>Use the option constant NAND_BBT_ABSPAGE and + define the absolute page number where the bad block + table starts in the field pages. If you have selected bad block + tables per chip and you have a multi chip array then the start page + must be given for each chip in the chip array. Note: there is no scan + for a table ident pattern performed, so the fields + pattern, veroffs, offs, len can be left uninitialized</para></listitem> + <listitem><para>Table location is automatically detected</para> + <para>The table can either be located in the first or the last good + blocks of the chip (device). Set NAND_BBT_LASTBLOCK to place + the bad block table at the end of the chip (device). The + bad block tables are marked and identified by a pattern which + is stored in the spare area of the first page in the block which + holds the bad block table. Store a pointer to the pattern + in the pattern field. Further the length of the pattern has to be + stored in len and the offset in the spare area must be given + in the offs member of the nand_bbt_descr stucture. For mirrored + bad block tables different patterns are mandatory.</para></listitem> + <listitem><para>Table creation</para> + <para>Set the option NAND_BBT_CREATE to enable the table creation + if no table can be found during the scan. Usually this is done only + once if a new chip is found. </para></listitem> + <listitem><para>Table write support</para> + <para>Set the option NAND_BBT_WRITE to enable the table write support. + This allows the update of the bad block table(s) in case a block has + to be marked bad due to wear. The MTD interface function block_markbad + is calling the update function of the bad block table. If the write + support is enabled then the table is updated on FLASH.</para> + <para> + Note: Write support should only be enabled for mirrored tables with + version control. + </para></listitem> + <listitem><para>Table version control</para> + <para>Set the option NAND_BBT_VERSION to enable the table version control. + It's highly recommended to enable this for mirrored tables with write + support. It makes sure that the risk of loosing the bad block + table information is reduced to the loss of the information about the + one worn out block which should be marked bad. The version is stored in + 4 consecutive bytes in the spare area of the device. The position of + the version number is defined by the member veroffs in the bad block table + descriptor.</para></listitem> + <listitem><para>Save block contents on write</para> + <para> + In case that the block which holds the bad block table does contain + other useful information, set the option NAND_BBT_SAVECONTENT. When + the bad block table is written then the whole block is read the bad + block table is updated and the block is erased and everything is + written back. If this option is not set only the bad block table + is written and everything else in the block is ignored and erased. + </para></listitem> + <listitem><para>Number of reserved blocks</para> + <para> + For automatic placement some blocks must be reserved for + bad block table storage. The number of reserved blocks is defined + in the maxblocks member of the babd block table description structure. + Reserving 4 blocks for mirrored tables should be a reasonable number. + This also limits the number of blocks which are scanned for the bad + block table ident pattern. + </para></listitem> + </itemizedlist> + </para> + </sect2> + </sect1> + <sect1> + <title>Spare area (auto)placement</title> + <para> + The nand driver implements different possibilities for + placement of filesystem data in the spare area, + <itemizedlist> + <listitem><para>Placement defined by fs driver</para></listitem> + <listitem><para>Automatic placement</para></listitem> + </itemizedlist> + The default placement function is automatic placement. The + nand driver has built in default placement schemes for the + various chiptypes. If due to hardware ECC functionality the + default placement does not fit then the board driver can + provide a own placement scheme. + </para> + <para> + File system drivers can provide a own placement scheme which + is used instead of the default placement scheme. + </para> + <para> + Placement schemes are defined by a nand_oobinfo structure + <programlisting> +struct nand_oobinfo { + int useecc; + int eccbytes; + int eccpos[24]; + int oobfree[8][2]; +}; + </programlisting> + <itemizedlist> + <listitem><para>useecc</para><para> + The useecc member controls the ecc and placement function. The header + file include/mtd/mtd-abi.h contains constants to select ecc and + placement. MTD_NANDECC_OFF switches off the ecc complete. This is + not recommended and available for testing and diagnosis only. + MTD_NANDECC_PLACE selects caller defined placement, MTD_NANDECC_AUTOPLACE + selects automatic placement. + </para></listitem> + <listitem><para>eccbytes</para><para> + The eccbytes member defines the number of ecc bytes per page. + </para></listitem> + <listitem><para>eccpos</para><para> + The eccpos array holds the byte offsets in the spare area where + the ecc codes are placed. + </para></listitem> + <listitem><para>oobfree</para><para> + The oobfree array defines the areas in the spare area which can be + used for automatic placement. The information is given in the format + {offset, size}. offset defines the start of the usable area, size the + length in bytes. More than one area can be defined. The list is terminated + by an {0, 0} entry. + </para></listitem> + </itemizedlist> + </para> + <sect2> + <title>Placement defined by fs driver</title> + <para> + The calling function provides a pointer to a nand_oobinfo + structure which defines the ecc placement. For writes the + caller must provide a spare area buffer along with the + data buffer. The spare area buffer size is (number of pages) * + (size of spare area). For reads the buffer size is + (number of pages) * ((size of spare area) + (number of ecc + steps per page) * sizeof (int)). The driver stores the + result of the ecc check for each tuple in the spare buffer. + The storage sequence is + </para> + <para> + <spare data page 0><ecc result 0>...<ecc result n> + </para> + <para> + ... + </para> + <para> + <spare data page n><ecc result 0>...<ecc result n> + </para> + <para> + This is a legacy mode used by YAFFS1. + </para> + <para> + If the spare area buffer is NULL then only the ECC placement is + done according to the given scheme in the nand_oobinfo structure. + </para> + </sect2> + <sect2> + <title>Automatic placement</title> + <para> + Automatic placement uses the built in defaults to place the + ecc bytes in the spare area. If filesystem data have to be stored / + read into the spare area then the calling function must provide a + buffer. The buffer size per page is determined by the oobfree array in + the nand_oobinfo structure. + </para> + <para> + If the spare area buffer is NULL then only the ECC placement is + done according to the default builtin scheme. + </para> + </sect2> + <sect2> + <title>User space placement selection</title> + <para> + All non ecc functions like mtd->read and mtd->write use an internal + structure, which can be set by an ioctl. This structure is preset + to the autoplacement default. + <programlisting> + ioctl (fd, MEMSETOOBSEL, oobsel); + </programlisting> + oobsel is a pointer to a user supplied structure of type + nand_oobconfig. The contents of this structure must match the + criteria of the filesystem, which will be used. See an example in utils/nandwrite.c. + </para> + </sect2> + </sect1> + <sect1> + <title>Spare area autoplacement default schemes</title> + <sect2> + <title>256 byte pagesize</title> +<informaltable><tgroup cols="3"><tbody> +<row> +<entry>Offset</entry> +<entry>Content</entry> +<entry>Comment</entry> +</row> +<row> +<entry>0x00</entry> +<entry>ECC byte 0</entry> +<entry>Error correction code byte 0</entry> +</row> +<row> +<entry>0x01</entry> +<entry>ECC byte 1</entry> +<entry>Error correction code byte 1</entry> +</row> +<row> +<entry>0x02</entry> +<entry>ECC byte 2</entry> +<entry>Error correction code byte 2</entry> +</row> +<row> +<entry>0x03</entry> +<entry>Autoplace 0</entry> +<entry></entry> +</row> +<row> +<entry>0x04</entry> +<entry>Autoplace 1</entry> +<entry></entry> +</row> +<row> +<entry>0x05</entry> +<entry>Bad block marker</entry> +<entry>If any bit in this byte is zero, then this block is bad. +This applies only to the first page in a block. In the remaining +pages this byte is reserved</entry> +</row> +<row> +<entry>0x06</entry> +<entry>Autoplace 2</entry> +<entry></entry> +</row> +<row> +<entry>0x07</entry> +<entry>Autoplace 3</entry> +<entry></entry> +</row> +</tbody></tgroup></informaltable> + </sect2> + <sect2> + <title>512 byte pagesize</title> +<informaltable><tgroup cols="3"><tbody> +<row> +<entry>Offset</entry> +<entry>Content</entry> +<entry>Comment</entry> +</row> +<row> +<entry>0x00</entry> +<entry>ECC byte 0</entry> +<entry>Error correction code byte 0 of the lower 256 Byte data in +this page</entry> +</row> +<row> +<entry>0x01</entry> +<entry>ECC byte 1</entry> +<entry>Error correction code byte 1 of the lower 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x02</entry> +<entry>ECC byte 2</entry> +<entry>Error correction code byte 2 of the lower 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x03</entry> +<entry>ECC byte 3</entry> +<entry>Error correction code byte 0 of the upper 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x04</entry> +<entry>reserved</entry> +<entry>reserved</entry> +</row> +<row> +<entry>0x05</entry> +<entry>Bad block marker</entry> +<entry>If any bit in this byte is zero, then this block is bad. +This applies only to the first page in a block. In the remaining +pages this byte is reserved</entry> +</row> +<row> +<entry>0x06</entry> +<entry>ECC byte 4</entry> +<entry>Error correction code byte 1 of the upper 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x07</entry> +<entry>ECC byte 5</entry> +<entry>Error correction code byte 2 of the upper 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x08 - 0x0F</entry> +<entry>Autoplace 0 - 7</entry> +<entry></entry> +</row> +</tbody></tgroup></informaltable> + </sect2> + <sect2> + <title>2048 byte pagesize</title> +<informaltable><tgroup cols="3"><tbody> +<row> +<entry>Offset</entry> +<entry>Content</entry> +<entry>Comment</entry> +</row> +<row> +<entry>0x00</entry> +<entry>Bad block marker</entry> +<entry>If any bit in this byte is zero, then this block is bad. +This applies only to the first page in a block. In the remaining +pages this byte is reserved</entry> +</row> +<row> +<entry>0x01</entry> +<entry>Reserved</entry> +<entry>Reserved</entry> +</row> +<row> +<entry>0x02-0x27</entry> +<entry>Autoplace 0 - 37</entry> +<entry></entry> +</row> +<row> +<entry>0x28</entry> +<entry>ECC byte 0</entry> +<entry>Error correction code byte 0 of the first 256 Byte data in +this page</entry> +</row> +<row> +<entry>0x29</entry> +<entry>ECC byte 1</entry> +<entry>Error correction code byte 1 of the first 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x2A</entry> +<entry>ECC byte 2</entry> +<entry>Error correction code byte 2 of the first 256 Bytes data in +this page</entry> +</row> +<row> +<entry>0x2B</entry> +<entry>ECC byte 3</entry> +<entry>Error correction code byte 0 of the second 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x2C</entry> +<entry>ECC byte 4</entry> +<entry>Error correction code byte 1 of the second 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x2D</entry> +<entry>ECC byte 5</entry> +<entry>Error correction code byte 2 of the second 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x2E</entry> +<entry>ECC byte 6</entry> +<entry>Error correction code byte 0 of the third 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x2F</entry> +<entry>ECC byte 7</entry> +<entry>Error correction code byte 1 of the third 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x30</entry> +<entry>ECC byte 8</entry> +<entry>Error correction code byte 2 of the third 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x31</entry> +<entry>ECC byte 9</entry> +<entry>Error correction code byte 0 of the fourth 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x32</entry> +<entry>ECC byte 10</entry> +<entry>Error correction code byte 1 of the fourth 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x33</entry> +<entry>ECC byte 11</entry> +<entry>Error correction code byte 2 of the fourth 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x34</entry> +<entry>ECC byte 12</entry> +<entry>Error correction code byte 0 of the fifth 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x35</entry> +<entry>ECC byte 13</entry> +<entry>Error correction code byte 1 of the fifth 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x36</entry> +<entry>ECC byte 14</entry> +<entry>Error correction code byte 2 of the fifth 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x37</entry> +<entry>ECC byte 15</entry> +<entry>Error correction code byte 0 of the sixt 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x38</entry> +<entry>ECC byte 16</entry> +<entry>Error correction code byte 1 of the sixt 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x39</entry> +<entry>ECC byte 17</entry> +<entry>Error correction code byte 2 of the sixt 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x3A</entry> +<entry>ECC byte 18</entry> +<entry>Error correction code byte 0 of the seventh 256 Bytes of +data in this page</entry> +</row> +<row> +<entry>0x3B</entry> +<entry>ECC byte 19</entry> +<entry>Error correction code byte 1 of the seventh 256 Bytes of +data in this page</entry> +</row> +<row> +<entry>0x3C</entry> +<entry>ECC byte 20</entry> +<entry>Error correction code byte 2 of the seventh 256 Bytes of +data in this page</entry> +</row> +<row> +<entry>0x3D</entry> +<entry>ECC byte 21</entry> +<entry>Error correction code byte 0 of the eigth 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x3E</entry> +<entry>ECC byte 22</entry> +<entry>Error correction code byte 1 of the eigth 256 Bytes of data +in this page</entry> +</row> +<row> +<entry>0x3F</entry> +<entry>ECC byte 23</entry> +<entry>Error correction code byte 2 of the eigth 256 Bytes of data +in this page</entry> +</row> +</tbody></tgroup></informaltable> + </sect2> + </sect1> + </chapter> + + <chapter id="filesystems"> + <title>Filesystem support</title> + <para> + The NAND driver provides all neccecary functions for a + filesystem via the MTD interface. + </para> + <para> + Filesystems must be aware of the NAND pecularities and + restrictions. One major restrictions of NAND Flash is, that you cannot + write as often as you want to a page. The consecutive writes to a page, + before erasing it again, are restricted to 1-3 writes, depending on the + manufacturers specifications. This applies similar to the spare area. + </para> + <para> + Therefor NAND aware filesystems must either write in page size chunks + or hold a writebuffer to collect smaller writes until they sum up to + pagesize. Available NAND aware filesystems: JFFS2, YAFFS. + </para> + <para> + The spare area usage to store filesystem data is controlled by + the spare area placement functionality which is described in one + of the earlier chapters. + </para> + </chapter> + <chapter id="tools"> + <title>Tools</title> + <para> + The MTD project provides a couple of helpful tools to handle NAND Flash. + <itemizedlist> + <listitem><para>flasherase, flasheraseall: Erase and format FLASH partitions</para></listitem> + <listitem><para>nandwrite: write filesystem images to NAND FLASH</para></listitem> + <listitem><para>nanddump: dump the contents of a NAND FLASH partitions</para></listitem> + </itemizedlist> + </para> + <para> + These tools are aware of the NAND restrictions. Please use those tools + instead of complaining about errors which are caused by non NAND aware + access methods. + </para> + </chapter> + + <chapter id="defines"> + <title>Constants</title> + <para> + This chapter describes the constants which might be relevant for a driver developer. + </para> + <sect1> + <title>Chip option constants</title> + <sect2> + <title>Constants for chip id table</title> + <para> + These constants are defined in nand.h. They are ored together to describe + the chip functionality. + <programlisting> +/* Chip can not auto increment pages */ +#define NAND_NO_AUTOINCR 0x00000001 +/* Buswitdh is 16 bit */ +#define NAND_BUSWIDTH_16 0x00000002 +/* Device supports partial programming without padding */ +#define NAND_NO_PADDING 0x00000004 +/* Chip has cache program function */ +#define NAND_CACHEPRG 0x00000008 +/* Chip has copy back function */ +#define NAND_COPYBACK 0x00000010 +/* AND Chip which has 4 banks and a confusing page / block + * assignment. See Renesas datasheet for further information */ +#define NAND_IS_AND 0x00000020 +/* Chip has a array of 4 pages which can be read without + * additional ready /busy waits */ +#define NAND_4PAGE_ARRAY 0x00000040 + </programlisting> + </para> + </sect2> + <sect2> + <title>Constants for runtime options</title> + <para> + These constants are defined in nand.h. They are ored together to describe + the functionality. + <programlisting> +/* Use a flash based bad block table. This option is parsed by the + * default bad block table function (nand_default_bbt). */ +#define NAND_USE_FLASH_BBT 0x00010000 +/* The hw ecc generator provides a syndrome instead a ecc value on read + * This can only work if we have the ecc bytes directly behind the + * data bytes. Applies for DOC and AG-AND Renesas HW Reed Solomon generators */ +#define NAND_HWECC_SYNDROME 0x00020000 + </programlisting> + </para> + </sect2> + </sect1> + + <sect1> + <title>ECC selection constants</title> + <para> + Use these constants to select the ECC algorithm. + <programlisting> +/* No ECC. Usage is not recommended ! */ +#define NAND_ECC_NONE 0 +/* Software ECC 3 byte ECC per 256 Byte data */ +#define NAND_ECC_SOFT 1 +/* Hardware ECC 3 byte ECC per 256 Byte data */ +#define NAND_ECC_HW3_256 2 +/* Hardware ECC 3 byte ECC per 512 Byte data */ +#define NAND_ECC_HW3_512 3 +/* Hardware ECC 6 byte ECC per 512 Byte data */ +#define NAND_ECC_HW6_512 4 +/* Hardware ECC 6 byte ECC per 512 Byte data */ +#define NAND_ECC_HW8_512 6 + </programlisting> + </para> + </sect1> + + <sect1> + <title>Hardware control related constants</title> + <para> + These constants describe the requested hardware access function when + the boardspecific hardware control function is called + <programlisting> +/* Select the chip by setting nCE to low */ +#define NAND_CTL_SETNCE 1 +/* Deselect the chip by setting nCE to high */ +#define NAND_CTL_CLRNCE 2 +/* Select the command latch by setting CLE to high */ +#define NAND_CTL_SETCLE 3 +/* Deselect the command latch by setting CLE to low */ +#define NAND_CTL_CLRCLE 4 +/* Select the address latch by setting ALE to high */ +#define NAND_CTL_SETALE 5 +/* Deselect the address latch by setting ALE to low */ +#define NAND_CTL_CLRALE 6 +/* Set write protection by setting WP to high. Not used! */ +#define NAND_CTL_SETWP 7 +/* Clear write protection by setting WP to low. Not used! */ +#define NAND_CTL_CLRWP 8 + </programlisting> + </para> + </sect1> + + <sect1> + <title>Bad block table related constants</title> + <para> + These constants describe the options used for bad block + table descriptors. + <programlisting> +/* Options for the bad block table descriptors */ + +/* The number of bits used per block in the bbt on the device */ +#define NAND_BBT_NRBITS_MSK 0x0000000F +#define NAND_BBT_1BIT 0x00000001 +#define NAND_BBT_2BIT 0x00000002 +#define NAND_BBT_4BIT 0x00000004 +#define NAND_BBT_8BIT 0x00000008 +/* The bad block table is in the last good block of the device */ +#define NAND_BBT_LASTBLOCK 0x00000010 +/* The bbt is at the given page, else we must scan for the bbt */ +#define NAND_BBT_ABSPAGE 0x00000020 +/* The bbt is at the given page, else we must scan for the bbt */ +#define NAND_BBT_SEARCH 0x00000040 +/* bbt is stored per chip on multichip devices */ +#define NAND_BBT_PERCHIP 0x00000080 +/* bbt has a version counter at offset veroffs */ +#define NAND_BBT_VERSION 0x00000100 +/* Create a bbt if none axists */ +#define NAND_BBT_CREATE 0x00000200 +/* Search good / bad pattern through all pages of a block */ +#define NAND_BBT_SCANALLPAGES 0x00000400 +/* Scan block empty during good / bad block scan */ +#define NAND_BBT_SCANEMPTY 0x00000800 +/* Write bbt if neccecary */ +#define NAND_BBT_WRITE 0x00001000 +/* Read and write back block contents when writing bbt */ +#define NAND_BBT_SAVECONTENT 0x00002000 + </programlisting> + </para> + </sect1> + + </chapter> + + <chapter id="structs"> + <title>Structures</title> + <para> + This chapter contains the autogenerated documentation of the structures which are + used in the NAND driver and might be relevant for a driver developer. Each + struct member has a short description which is marked with an [XXX] identifier. + See the chapter "Documentation hints" for an explanation. + </para> +!Iinclude/linux/mtd/nand.h + </chapter> + + <chapter id="pubfunctions"> + <title>Public Functions Provided</title> + <para> + This chapter contains the autogenerated documentation of the NAND kernel API functions + which are exported. Each function has a short description which is marked with an [XXX] identifier. + See the chapter "Documentation hints" for an explanation. + </para> +!Edrivers/mtd/nand/nand_base.c +!Edrivers/mtd/nand/nand_bbt.c +!Edrivers/mtd/nand/nand_ecc.c + </chapter> + + <chapter id="intfunctions"> + <title>Internal Functions Provided</title> + <para> + This chapter contains the autogenerated documentation of the NAND driver internal functions. + Each function has a short description which is marked with an [XXX] identifier. + See the chapter "Documentation hints" for an explanation. + The functions marked with [DEFAULT] might be relevant for a board driver developer. + </para> +!Idrivers/mtd/nand/nand_base.c +!Idrivers/mtd/nand/nand_bbt.c +!Idrivers/mtd/nand/nand_ecc.c + </chapter> + + <chapter id="credits"> + <title>Credits</title> + <para> + The following people have contributed to the NAND driver: + <orderedlist> + <listitem><para>Steven J. Hill<email>sjhill@realitydiluted.com</email></para></listitem> + <listitem><para>David Woodhouse<email>dwmw2@infradead.org</email></para></listitem> + <listitem><para>Thomas Gleixner<email>tglx@linutronix.de</email></para></listitem> + </orderedlist> + A lot of users have provided bugfixes, improvements and helping hands for testing. + Thanks a lot. + </para> + <para> + The following people have contributed to this document: + <orderedlist> + <listitem><para>Thomas Gleixner<email>tglx@linutronix.de</email></para></listitem> + </orderedlist> + </para> + </chapter> +</book> diff --git a/Documentation/DocBook/procfs-guide.tmpl b/Documentation/DocBook/procfs-guide.tmpl new file mode 100644 index 000000000000..45cad23efefa --- /dev/null +++ b/Documentation/DocBook/procfs-guide.tmpl @@ -0,0 +1,591 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [ +<!ENTITY procfsexample SYSTEM "procfs_example.xml"> +]> + +<book id="LKProcfsGuide"> + <bookinfo> + <title>Linux Kernel Procfs Guide</title> + + <authorgroup> + <author> + <firstname>Erik</firstname> + <othername>(J.A.K.)</othername> + <surname>Mouw</surname> + <affiliation> + <orgname>Delft University of Technology</orgname> + <orgdiv>Faculty of Information Technology and Systems</orgdiv> + <address> + <email>J.A.K.Mouw@its.tudelft.nl</email> + <pob>PO BOX 5031</pob> + <postcode>2600 GA</postcode> + <city>Delft</city> + <country>The Netherlands</country> + </address> + </affiliation> + </author> + </authorgroup> + + <revhistory> + <revision> + <revnumber>1.0 </revnumber> + <date>May 30, 2001</date> + <revremark>Initial revision posted to linux-kernel</revremark> + </revision> + <revision> + <revnumber>1.1 </revnumber> + <date>June 3, 2001</date> + <revremark>Revised after comments from linux-kernel</revremark> + </revision> + </revhistory> + + <copyright> + <year>2001</year> + <holder>Erik Mouw</holder> + </copyright> + + + <legalnotice> + <para> + This documentation is free software; you can redistribute it + and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + </para> + + <para> + This documentation is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR + PURPOSE. See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + + + + + <toc> + </toc> + + + + + <preface> + <title>Preface</title> + + <para> + This guide describes the use of the procfs file system from + within the Linux kernel. The idea to write this guide came up on + the #kernelnewbies IRC channel (see <ulink + url="http://www.kernelnewbies.org/">http://www.kernelnewbies.org/</ulink>), + when Jeff Garzik explained the use of procfs and forwarded me a + message Alexander Viro wrote to the linux-kernel mailing list. I + agreed to write it up nicely, so here it is. + </para> + + <para> + I'd like to thank Jeff Garzik + <email>jgarzik@pobox.com</email> and Alexander Viro + <email>viro@parcelfarce.linux.theplanet.co.uk</email> for their input, + Tim Waugh <email>twaugh@redhat.com</email> for his <ulink + url="http://people.redhat.com/twaugh/docbook/selfdocbook/">Selfdocbook</ulink>, + and Marc Joosen <email>marcj@historia.et.tudelft.nl</email> for + proofreading. + </para> + + <para> + This documentation was written while working on the LART + computing board (<ulink + url="http://www.lart.tudelft.nl/">http://www.lart.tudelft.nl/</ulink>), + which is sponsored by the Mobile Multi-media Communications + (<ulink + url="http://www.mmc.tudelft.nl/">http://www.mmc.tudelft.nl/</ulink>) + and Ubiquitous Communications (<ulink + url="http://www.ubicom.tudelft.nl/">http://www.ubicom.tudelft.nl/</ulink>) + projects. + </para> + + <para> + Erik + </para> + </preface> + + + + + <chapter id="intro"> + <title>Introduction</title> + + <para> + The <filename class="directory">/proc</filename> file system + (procfs) is a special file system in the linux kernel. It's a + virtual file system: it is not associated with a block device + but exists only in memory. The files in the procfs are there to + allow userland programs access to certain information from the + kernel (like process information in <filename + class="directory">/proc/[0-9]+/</filename>), but also for debug + purposes (like <filename>/proc/ksyms</filename>). + </para> + + <para> + This guide describes the use of the procfs file system from + within the Linux kernel. It starts by introducing all relevant + functions to manage the files within the file system. After that + it shows how to communicate with userland, and some tips and + tricks will be pointed out. Finally a complete example will be + shown. + </para> + + <para> + Note that the files in <filename + class="directory">/proc/sys</filename> are sysctl files: they + don't belong to procfs and are governed by a completely + different API described in the Kernel API book. + </para> + </chapter> + + + + + <chapter id="managing"> + <title>Managing procfs entries</title> + + <para> + This chapter describes the functions that various kernel + components use to populate the procfs with files, symlinks, + device nodes, and directories. + </para> + + <para> + A minor note before we start: if you want to use any of the + procfs functions, be sure to include the correct header file! + This should be one of the first lines in your code: + </para> + + <programlisting> +#include <linux/proc_fs.h> + </programlisting> + + + + + <sect1 id="regularfile"> + <title>Creating a regular file</title> + + <funcsynopsis> + <funcprototype> + <funcdef>struct proc_dir_entry* <function>create_proc_entry</function></funcdef> + <paramdef>const char* <parameter>name</parameter></paramdef> + <paramdef>mode_t <parameter>mode</parameter></paramdef> + <paramdef>struct proc_dir_entry* <parameter>parent</parameter></paramdef> + </funcprototype> + </funcsynopsis> + + <para> + This function creates a regular file with the name + <parameter>name</parameter>, file mode + <parameter>mode</parameter> in the directory + <parameter>parent</parameter>. To create a file in the root of + the procfs, use <constant>NULL</constant> as + <parameter>parent</parameter> parameter. When successful, the + function will return a pointer to the freshly created + <structname>struct proc_dir_entry</structname>; otherwise it + will return <constant>NULL</constant>. <xref + linkend="userland"/> describes how to do something useful with + regular files. + </para> + + <para> + Note that it is specifically supported that you can pass a + path that spans multiple directories. For example + <function>create_proc_entry</function>(<parameter>"drivers/via0/info"</parameter>) + will create the <filename class="directory">via0</filename> + directory if necessary, with standard + <constant>0755</constant> permissions. + </para> + + <para> + If you only want to be able to read the file, the function + <function>create_proc_read_entry</function> described in <xref + linkend="convenience"/> may be used to create and initialise + the procfs entry in one single call. + </para> + </sect1> + + + + + <sect1> + <title>Creating a symlink</title> + + <funcsynopsis> + <funcprototype> + <funcdef>struct proc_dir_entry* + <function>proc_symlink</function></funcdef> <paramdef>const + char* <parameter>name</parameter></paramdef> + <paramdef>struct proc_dir_entry* + <parameter>parent</parameter></paramdef> <paramdef>const + char* <parameter>dest</parameter></paramdef> + </funcprototype> + </funcsynopsis> + + <para> + This creates a symlink in the procfs directory + <parameter>parent</parameter> that points from + <parameter>name</parameter> to + <parameter>dest</parameter>. This translates in userland to + <literal>ln -s</literal> <parameter>dest</parameter> + <parameter>name</parameter>. + </para> + </sect1> + + <sect1> + <title>Creating a directory</title> + + <funcsynopsis> + <funcprototype> + <funcdef>struct proc_dir_entry* <function>proc_mkdir</function></funcdef> + <paramdef>const char* <parameter>name</parameter></paramdef> + <paramdef>struct proc_dir_entry* <parameter>parent</parameter></paramdef> + </funcprototype> + </funcsynopsis> + + <para> + Create a directory <parameter>name</parameter> in the procfs + directory <parameter>parent</parameter>. + </para> + </sect1> + + + + + <sect1> + <title>Removing an entry</title> + + <funcsynopsis> + <funcprototype> + <funcdef>void <function>remove_proc_entry</function></funcdef> + <paramdef>const char* <parameter>name</parameter></paramdef> + <paramdef>struct proc_dir_entry* <parameter>parent</parameter></paramdef> + </funcprototype> + </funcsynopsis> + + <para> + Removes the entry <parameter>name</parameter> in the directory + <parameter>parent</parameter> from the procfs. Entries are + removed by their <emphasis>name</emphasis>, not by the + <structname>struct proc_dir_entry</structname> returned by the + various create functions. Note that this function doesn't + recursively remove entries. + </para> + + <para> + Be sure to free the <structfield>data</structfield> entry from + the <structname>struct proc_dir_entry</structname> before + <function>remove_proc_entry</function> is called (that is: if + there was some <structfield>data</structfield> allocated, of + course). See <xref linkend="usingdata"/> for more information + on using the <structfield>data</structfield> entry. + </para> + </sect1> + </chapter> + + + + + <chapter id="userland"> + <title>Communicating with userland</title> + + <para> + Instead of reading (or writing) information directly from + kernel memory, procfs works with <emphasis>call back + functions</emphasis> for files: functions that are called when + a specific file is being read or written. Such functions have + to be initialised after the procfs file is created by setting + the <structfield>read_proc</structfield> and/or + <structfield>write_proc</structfield> fields in the + <structname>struct proc_dir_entry*</structname> that the + function <function>create_proc_entry</function> returned: + </para> + + <programlisting> +struct proc_dir_entry* entry; + +entry->read_proc = read_proc_foo; +entry->write_proc = write_proc_foo; + </programlisting> + + <para> + If you only want to use a the + <structfield>read_proc</structfield>, the function + <function>create_proc_read_entry</function> described in <xref + linkend="convenience"/> may be used to create and initialise the + procfs entry in one single call. + </para> + + + + <sect1> + <title>Reading data</title> + + <para> + The read function is a call back function that allows userland + processes to read data from the kernel. The read function + should have the following format: + </para> + + <funcsynopsis> + <funcprototype> + <funcdef>int <function>read_func</function></funcdef> + <paramdef>char* <parameter>page</parameter></paramdef> + <paramdef>char** <parameter>start</parameter></paramdef> + <paramdef>off_t <parameter>off</parameter></paramdef> + <paramdef>int <parameter>count</parameter></paramdef> + <paramdef>int* <parameter>eof</parameter></paramdef> + <paramdef>void* <parameter>data</parameter></paramdef> + </funcprototype> + </funcsynopsis> + + <para> + The read function should write its information into the + <parameter>page</parameter>. For proper use, the function + should start writing at an offset of + <parameter>off</parameter> in <parameter>page</parameter> and + write at most <parameter>count</parameter> bytes, but because + most read functions are quite simple and only return a small + amount of information, these two parameters are usually + ignored (it breaks pagers like <literal>more</literal> and + <literal>less</literal>, but <literal>cat</literal> still + works). + </para> + + <para> + If the <parameter>off</parameter> and + <parameter>count</parameter> parameters are properly used, + <parameter>eof</parameter> should be used to signal that the + end of the file has been reached by writing + <literal>1</literal> to the memory location + <parameter>eof</parameter> points to. + </para> + + <para> + The parameter <parameter>start</parameter> doesn't seem to be + used anywhere in the kernel. The <parameter>data</parameter> + parameter can be used to create a single call back function for + several files, see <xref linkend="usingdata"/>. + </para> + + <para> + The <function>read_func</function> function must return the + number of bytes written into the <parameter>page</parameter>. + </para> + + <para> + <xref linkend="example"/> shows how to use a read call back + function. + </para> + </sect1> + + + + + <sect1> + <title>Writing data</title> + + <para> + The write call back function allows a userland process to write + data to the kernel, so it has some kind of control over the + kernel. The write function should have the following format: + </para> + + <funcsynopsis> + <funcprototype> + <funcdef>int <function>write_func</function></funcdef> + <paramdef>struct file* <parameter>file</parameter></paramdef> + <paramdef>const char* <parameter>buffer</parameter></paramdef> + <paramdef>unsigned long <parameter>count</parameter></paramdef> + <paramdef>void* <parameter>data</parameter></paramdef> + </funcprototype> + </funcsynopsis> + + <para> + The write function should read <parameter>count</parameter> + bytes at maximum from the <parameter>buffer</parameter>. Note + that the <parameter>buffer</parameter> doesn't live in the + kernel's memory space, so it should first be copied to kernel + space with <function>copy_from_user</function>. The + <parameter>file</parameter> parameter is usually + ignored. <xref linkend="usingdata"/> shows how to use the + <parameter>data</parameter> parameter. + </para> + + <para> + Again, <xref linkend="example"/> shows how to use this call back + function. + </para> + </sect1> + + + + + <sect1 id="usingdata"> + <title>A single call back for many files</title> + + <para> + When a large number of almost identical files is used, it's + quite inconvenient to use a separate call back function for + each file. A better approach is to have a single call back + function that distinguishes between the files by using the + <structfield>data</structfield> field in <structname>struct + proc_dir_entry</structname>. First of all, the + <structfield>data</structfield> field has to be initialised: + </para> + + <programlisting> +struct proc_dir_entry* entry; +struct my_file_data *file_data; + +file_data = kmalloc(sizeof(struct my_file_data), GFP_KERNEL); +entry->data = file_data; + </programlisting> + + <para> + The <structfield>data</structfield> field is a <type>void + *</type>, so it can be initialised with anything. + </para> + + &l |