When you run awk, you specify an awk program which tells awk what to do. The program consists of a series of rules. (It may also contain function definitions, but that is an advanced feature, so let's ignore it for now. See section User-defined Functions.) Each rule specifies one pattern to search for, and one action to perform when that pattern is found.
Syntactically, a rule consists of a pattern followed by an action. The action is enclosed in curly braces to separate it from the pattern. Rules are usually separated by newlines. Therefore, an awk program looks like this:
pattern { action } pattern { action } ...
The following command runs a simple awk program that searches the input file `BBS-list' for the string of characters: `foo'. (A string of characters is usually called, quite simply, a string. The term string is perhaps based on similar usage in English, such as "a string of pearls," or, "a string of cars in a train.")
awk '/foo/ { print $0 }' BBS-listWhen lines containing `foo' are found, they are printed, because `print $0' means print the current line. (Just `print' by itself also means the same thing, so we could have written that instead.)
You will notice that slashes, `/', surround the string `foo' in the actual awk program. The slashes indicate that `foo' is a pattern to search for. This type of pattern is called a regular expression, and is covered in more detail later (see section Regular Expressions as Patterns). There are single-quotes around the awk program so that the shell won't interpret any of it as special shell characters.
Here is what this program prints:
fooey 555-1234 2400/1200/300 B foot 555-6699 1200/300 B macfoo 555-6480 1200/300 A sabafoo 555-2127 1200/300 C
In an awk rule, either the pattern or the action can be omitted, but not both. If the pattern is omitted, then the action is performed for every input line. If the action is omitted, the default action is to print all lines that match the pattern.
Thus, we could leave out the action (the print statement and the curly braces) in the above example, and the result would be the same: all lines matching the pattern `foo' would be printed. By comparison, omitting the print statement but retaining the curly braces makes an empty action that does nothing; then no lines would be printed.
After processing all the rules (perhaps none) that match the line, awk reads the next line (however, see section The next Statement). This continues until the end of the file is reached.
For example, the awk program:
/12/ { print $0 } /21/ { print $0 }contains two rules. The first rule has the string `12' as the pattern and `print $0' as the action. The second rule has the string `21' as the pattern and also has `print $0' as the action. Each rule's action is enclosed in its own pair of braces.
This awk program prints every line that contains the string `12' or the string `21'. If a line contains both strings, it is printed twice, once by each rule.
If we run this program on our two sample data files, `BBS-list' and `inventory-shipped', as shown here:
awk '/12/ { print $0 } /21/ { print $0 }' BBS-list inventory-shippedwe get the following output:
aardvark 555-5553 1200/300 B alpo-net 555-3412 2400/1200/300 A barfly 555-7685 1200/300 A bites 555-1675 2400/1200/300 A core 555-2912 1200/300 C fooey 555-1234 2400/1200/300 B foot 555-6699 1200/300 B macfoo 555-6480 1200/300 A sdace 555-3430 2400/1200/300 A sabafoo 555-2127 1200/300 C sabafoo 555-2127 1200/300 C Jan 21 36 64 620 Apr 21 70 74 514Note how the line in `BBS-list' beginning with `sabafoo' was printed twice, once for each rule.
ls -l | awk '$5 == "Nov" { sum += $4 } END { print sum }'This command prints the total number of bytes in all the files in the current directory that were last modified in November (of any year). (In the C shell you would need to type a semicolon and then a backslash at the end of the first line; in the Bourne shell or the Bourne-Again shell, you can type the example as shown.)
The `ls -l' part of this example is a command that gives you a full listing of all the files in a directory, including file size and date. Its output looks like this:
-rw-r--r-- 1 close 1933 Nov 7 13:05 Makefile -rw-r--r-- 1 close 10809 Nov 7 13:03 gawk.h -rw-r--r-- 1 close 983 Apr 13 12:14 gawk.tab.h -rw-r--r-- 1 close 31869 Jun 15 12:20 gawk.y -rw-r--r-- 1 close 22414 Nov 7 13:03 gawk1.c -rw-r--r-- 1 close 37455 Nov 7 13:03 gawk2.c -rw-r--r-- 1 close 27511 Dec 9 13:07 gawk3.c -rw-r--r-- 1 close 7989 Nov 7 13:03 gawk4.cThe first field contains read-write permissions, the second field contains the number of links to the file, and the third field identifies the owner of the file. The fourth field contains the size of the file in bytes. The fifth, sixth, and seventh fields contain the month, day, and time, respectively, that the file was last modified. Finally, the eighth field contains the name of the file.
The $5 == "Nov" in our awk program is an expression that tests whether the fifth field of the output from `ls -l' matches the string `Nov'. Each time a line has the string `Nov' in its fifth field, the action `{ sum += $4 }' is performed. This adds the fourth field (the file size) to the variable sum. As a result, when awk has finished reading all the input lines, sum is the sum of the sizes of files whose lines matched the pattern.
After the last line of output from ls has been processed, the END rule is executed, and the value of sum is printed. In this example, the value of sum would be 80600.
These more advanced awk techniques are covered in later sections (see section Actions: Overview). Before you can move on to more advanced awk programming, you have to know how awk interprets your input and displays your output. By manipulating fields and using print statements, you can produce some very useful and spectacular looking reports.
There are several ways to run an awk program. If the program is short, it is easiest to include it in the command that runs awk, like this:
awk 'program' input-file1 input-file2 ...where program consists of a series of patterns and actions, as described earlier.
When the program is long, you would probably prefer to put it in a file and run it with a command like this:
awk -f program-file input-file1 input-file2 ...
awk 'program' input-file1 input-file2 ...where program consists of a series of patterns and actions, as described earlier.
This command format tells the shell to start awk and use the program to process records in the input file(s). There are single quotes around the program so that the shell doesn't interpret any awk characters as special shell characters. They cause the shell to treat all of program as a single argument for awk. They also allow program to be more than one line long.
This format is also useful for running short or medium-sized awk programs from shell scripts, because it avoids the need for a separate file for the awk program. A self-contained shell script is more reliable since there are no other files to misplace.
You can also use awk without any input files. If you type the command line:
awk 'program'then awk applies the program to the standard input, which usually means whatever you type on the terminal. This continues until you indicate end-of-file by typing Control-d.
For example, if you execute this command:
awk '/th/'whatever you type next is taken as data for that awk program. If you go on to type the following data:
Kathy Ben Tom Beth Seth Karen Thomas Control-dthen awk prints this output:
Kathy Beth Sethas matching the pattern `th'. Notice that it did not recognize `Thomas' as matching the pattern. The awk language is case sensitive, and matches patterns exactly. (However, you can override this with the variable IGNORECASE. See section Case-sensitivity in Matching.)
Sometimes your awk programs can be very long. In this case it is more convenient to put the program into a separate file. To tell awk to use that file for its program, you type:
awk -f source-file input-file1 input-file2 ...The `-f' tells the awk utility to get the awk program from the file source-file. Any file name can be used for source-file. For example, you could put the program:
/th/into the file `th-prog'. Then this command:
awk -f th-progdoes the same thing as this one:
awk '/th/'which was explained earlier (see section Running awk without Input Files). Note that you don't usually need single quotes around the file name that you specify with `-f', because most file names don't contain any of the shell's special characters.
If you want to identify your awk program files clearly as such, you can add the extension `.awk' to the file name. This doesn't affect the execution of the awk program, but it does make "housekeeping" easier.
For example, you could create a text file named `hello', containing the following (where `BEGIN' is a feature we have not yet discussed):
#! /bin/awk -f # a sample awk program BEGIN { print "hello, world" }After making this file executable (with the chmod command), you can simply type:
helloat the shell, and the system will arrange to run awk as if you had typed:
awk -f helloSelf-contained awk scripts are useful when you want to write a program which users can invoke without knowing that the program is written in awk.
If your system does not support the `#!' mechanism, you can get a similar effect using a regular shell script. It would look something like this:
: The colon makes sure this script is executed by the Bourne shell. awk 'program' "$@"Using this technique, it is vital to enclose the program in single quotes to protect it from interpretation by the shell. If you omit the quotes, only a shell wizard can predict the result.
The `"$@"' causes the shell to forward all the command line arguments to the awk program, without interpretation. The first line, which starts with a colon, is used so that this shell script will work even if invoked by a user who uses the C shell.
In the awk language, a comment starts with the sharp sign character, `#', and continues to the end of the line. The awk language ignores the rest of a line following a sharp sign. For example, we could have put the following into `th-prog':
# This program finds records containing the pattern `th'. This is how # you continue comments on additional lines. /th/You can put comment lines into keyboard-composed throw-away awk programs also, but this usually isn't very useful; the purpose of a comment is to help you or another person understand the program at another time.
awk '/12/ { print $0 } /21/ { print $0 }' BBS-list inventory-shippedBut sometimes statements can be more than one line, and lines can contain several statements. You can split a statement into multiple lines by inserting a newline after any of the following:
, { ? : || && do elseA newline at any other point is considered the end of the statement.
If you would like to split a single statement into two lines at a point where a newline would terminate it, you can continue it by ending the first line with a backslash character, `\'. This is allowed absolutely anywhere in the statement, even in the middle of a string or regular expression. For example:
awk '/This program is too long, so continue it\ on the next line/ { print $1 }'We have generally not used backslash continuation in the sample programs in this manual. Since there is no limit on the length of a line, it is never strictly necessary; it just makes programs prettier. We have preferred to make them even more pretty by keeping the statements short. Backslash continuation is most useful when your awk program is in a separate source file, instead of typed in on the command line.
Warning: backslash continuation does not work as described above with the C shell. Continuation with backslash works for awk programs in files, and also for one-shot programs provided you are using the Bourne shell or the Bourne-again shell. But the C shell used on Berkeley Unix behaves differently! There, you must use two backslashes in a row, followed by a newline.
When awk statements within one rule are short, you might want to put more than one of them on a line. You do this by separating the statements with semicolons, `;'. This also applies to the rules themselves. Thus, the above example program could have been written:
/12/ { print $0 } ; /21/ { print $0 }Note: the requirement that rules on the same line must be separated with a semicolon is a recent change in the awk language; it was done for consistency with the treatment of statements within an action.
What use is all of this to me, you might ask? Using additional utility programs, more advanced patterns, field separators, arithmetic statements, and other selection criteria, you can produce much more complex output. The awk language is very useful for producing reports from large amounts of raw data, such as summarizing information from the output of other utility programs such as ls. (See section A More Complex Example.)
Programs written with awk are usually much smaller than they would be in other languages. This makes awk programs easy to compose and use. Often awk programs can be quickly composed at your terminal, used once, and thrown away. Since awk programs are interpreted, you can avoid the usually lengthy edit-compile-test-debug cycle of software development.
Complex programs have been written in awk, including a complete retargetable assembler for 8-bit microprocessors (see section Glossary, for more information) and a microcode assembler for a special purpose Prolog computer. However, awk's capabilities are strained by tasks of such complexity.
If you find yourself writing awk scripts of more than, say, a few hundred lines, you might consider using a different programming language. Emacs Lisp is a good choice if you need sophisticated string or pattern matching capabilities. The shell is also good at string and pattern matching; in addition, it allows powerful use of the system utilities. More conventional languages, such as C, C++, and Lisp, offer better facilities for system programming and for managing the complexity of large programs. Programs in these languages may require more lines of source code than the equivalent awk programs, but they are easier to maintain and usually run more efficiently.
Go to the previous, next section.