Scripting
Introduction
This page discusses some techniques that a programmer can use to avoid excessive handwork -- either in the case of one time tasks or in the automation of repeated tasks.The best advice anyone can give is this: practice, practice, practice, and more practice! Rest a while, then practice some more!
The following paragraphs are based on the standard Unix text processing tools -- but since these programs are available on almost all platforms these days, that fact shouldn't prevent the suggestions here from being useful in many environments.
This page discusses techiques for writing scripts that work well when developed from the command line. That is, instead of initially bringing up a text editor, the scripts described here are easy to build and experiment with from the command line (sh, bash, ksh, cmd.exe etc).
Note that playing around with the command line is essential to good script development. You have to practice, practice, practice! The easiest way to do that is to just type in a command and see what it does. In many cases, command line scripts are both undocumented and undocumentable -- you just let your juices flow and produce a sequence of commands that works. If you are having to spend a lot of time designing, you should be using some other programming language that will allow all this work to be expressed at the full speed of a compiled language.
Of course, you'll eventually have to copy the command lines into a file and save the script for later use, but experimentation does not require it -- and you can learn a lot from these experiments. (In saying this, of course, I'm assuming that your command line interpreter has a way of recalling previously typed lines and editing before re-running them.)
Filter programming
There's an old joke that goes something like this: "If all you have is a hammer, the whole world looks like a nail." This joke makes fun of people who limit themselves to just one way of thinking.However, mathmeticians do this all the time: they reduce really complex problems into a collection of little problems that they already know how to solve -- then they quit, because those problems are already solved. The speed and the ease of developing programs will benefit in this same way: reduce the big problem into a series of well understood steps, and you'll get it done much faster.
The principle technique that will be discussed in this page is called "filter programming." This basically means that programs are constructed as a series of filters that:
- select only interesting parts of their standard input file and
- transform those parts in some way before
- writing to standard out.
Filters, when combined with command line "for" loops can solve many real world problems quickly and easily.
The easiest way to use filter programs is to string program invocations together using the command line interpreter's pipe operator, "|". The pipe symbol universally means to take the standard output of one program and use it as the standard input to another. Consider:
This particular pair of command lines should print the following:The first command, above, creates a file named "fred.c" in the current directory. The second command has 2 pipe symbols. The first pipe forces the output of the ls (list directory) command to be the input of the grep (string finding program). The second pipe symbol forces the output of grep to be the standard input of the sed (string replacing) command.
Now, if there were more files with the text "fred" in them, the output of the above pipe would also contain those names with the RED substitution.
A very, very, common piping sequence used in sh, bash, or ksh looks like this:
Here's how the above should be interpreted:- grep is used select interesting lines from some file
- sed is used to eliminate unnecessary text from the selected lines
- the remaining text is fed line by line to the "while" command line interpreter command which splits the words on the lines into named pieces (variables) and processes each piece using additional shell commands
Note that the cmd.exe interpreter does provide a way to a run a program or command line pipe and process its output, line by line. Here's how:
for /F "usebackq" %line in( `grep "fred"` ) do {
echo %line%
}
Warning: Sadly, the geniuses at Microsoft who added this useful option to the for command neglected to put in support for the | operator -- thus you cannot write this:
for /F "usebackq" %line in( `ls *.c | grep "fred"` ) do {
echo %line%
}
Use "cmd.exe /c help for" for more details.
Note that filters don't necessarily reduce the amount of data that is being processed. For example, you could begin with a list of directory names fed to some filter and have it list all the files in each directory and pass that as the output -- thus greatly increasing the amount of data.
Don't get the idea that filter programming is only about text replacements -- although that is a key component. Many tasks that involve making lists and processing each item -- particularly in multiple steps -- are right for filter programming.
A key component of efficient script writing is understand which of the programs in the toolset mentioned below can be efficiently used to accomplish needed tasks. Not that efficiency is all that high of a priority goal. Any task that is only going to be performed once doesn't really need to be efficient. Tasks that can be run sometime between 2 and 3 am every night and will be done by 8 am are unlikely to need much efficiency either. Of course there are always exceptions -- just as there are other programming languages.
Example Script Fragments
Paragraphs in this section are devoted to many different kinds of script fragments and their explanations.Processing all files with particular words in them
Lets say that you want to delete all the files in the current directory with the word "trash" in them. The fast way is to use grep to make a list of said files and process the files in that list -- but lets look at some other way's first:The slow way -- invoke grep once for each file:
#
# sh, bash, and ksh example
#
for f in *
do
grep -w trash "$f"
if [ $? = 0 ]
then
rm -f "$f"
fi
done
Here's a faster way that only invokes grep 1 time:
#
# sh, bash, and ksh example
#
rm -f `grep -wcl trash *`
Warning: This won't work properly if the files in the current directory have blanks in their names!
To solve the "blanks in the name" problem, you can write this script fragment:
#
# sh, bash, and ksh example
#
grep -wcl trash * |
while read name
do
rm -f "$name"
done
#
# windows example
#
for /F "usebackq" %var in (`grep -wcl trash *.*`) del %var%
Making lists from unwieldy log files
A common step in filter programming is to take the output of some program that some useful data in it and then deleting all the un-interesting stuff using grep and sed. You then either do further filtering and transformation, or apply some sort of for loop to the left overs.For example, suppose you had a log file produced by a program designed to describe files in a way that is attractive for human beings, but you want to grab some of the data out of the log and use it as a list objects to process. Maybe you are using a configuration management system, such as Rational's "clearcase." It has a command feature called "describe" that produces all kinds of information about a file sitting in your directory. One of the many lines of output it produces is the is the name of the user who actually created the file version that you have sitting in your directory. Suppose your goal is to list the names of the files in your directory along with the user that created them and you only want to have look at that specific bit of information.
Lets further say the output from describe looks something like this (in reality clearcase
produces a much different output, but we're just doing a thought experiment here.)
File: junk.cpp
Version: /main/branch/LATEST
Created By: userBob
Description:
Blah blah blah
Derived from: /main/branch/14
more blah
and still more blah
The goal then is to use filter programming techniques to process each file in the directory using the describe command and pipe the output thereof to sequence of filters that produces output that looks like this:
userBob junk.cpp
frank main.h
sysadmin size.log
The basic filtering and transformation steps are:
- Call the "describe" program on all files in the directory -- either one at a time using a for loop or in a single invocation, the describe program happens to support it -- lets pretend that it does not.
- Extract from the output those lines associated with each file which
are of interest. In this case, we are interested in the lines containing
- File:
- Created By:
- We now have pairs of lines for each file and need to join them together to format the output data -- but the lines are in the wrong order! But how?
Even in sh, bash, or ksh, reading the lines in pairs and echoing them out would be very slow. Another way to accomplish the task would be to eliminate all breaks between lines and then use other clues -- such as the word "File:" at the beginning of each pair to help create the proper format. An easy way to get rid of the end of line markers is to use the tr command. It translates characters from its standard input file to its standard output file.
One translation that we could do would be to translate line breaks
into spaces (joining all lines in the file into 1 long line)-- like this:
tr '\n' ' ' <inputFile
tr -d '\n' <inputFile
Let's say that we have a file named /tmp/one.txt and we want to eliminate the line breaks and store it in /tmp/two.txt. Here is how we'd write the tr invocation:
tr '\n' ' ' </tmp/one.txt >/tmp/two.txt
File: f.c
Created By: bob
File: m.h
Created By: hank
File: f.cpp Created By: Bob File: m.h Created By: hank
Given this line, we can now construct the output format we like using sed and tr (again). We need to use tr again so that we can put line breaks back in where they go!
The sed program is needed of course to eliminate the words "File:" and "Created By:" and is also needed to re-order the text so that the name of the creating user appears before the created file. We'll also leave a marker in the text, a single character, so that we can use tr to translate the marker into a line break.
This is a relatively sophisticated use of sed, but the general idea here is to replace all patterns that look like this:
File: filename Created By: userName
userName filename|
Here is an invocation of sed that will accomplish the above task:
sed -e "s/File: *\([^ ]\+\) *Created By: \([^ ]\+\)/\2 \1|/g"
See this page for resources on how to invoke sed. It takes practice.
So, invoking sed using the above options on the file currently stored in /tmp/two.txt and writing the output to /tmp/three.txt will give the following contents (in /tmp/three.txt):
Bob f.cpp|hank m.h|
tr '|' '\n' </tmp/three.txt >/tmp/four.txt
Bob f.cpp
hank m.h
Finally, if command line length were no object, we could put all these steps in one single long command line. In sh, bash, and ksh, the pipe operator (|) lets us combine many command lines into one big command. In cmd.exe you are still limited to total line length, so you might not be able to turn all these steps into one giant pipe command. In bash, you'd probably end up with a command script that looks like this:
for f in * ; do describe "$f" ; done |
grep -E '^ *File:|^ Created By:' |
tr '\n' ' ' |
sed ... |
tr '|' '\n'
Filtering duplicates
Often when using grep to extract text from files -- so as to construct lists -- many instances of a given word or string will appear. Sometimes this is good or harmless, but other times, this becomes a problem.The sort program provides this service -- assuming that you use the -u option to eliminate duplicates. For example, suppose file /tmp/s1.txt has the following contents:
line zero
one little endian
two little endians
three little endians
four little endians
last line
little endian
little endians
little endians
little endians
little endian
little endians
Matching words from 2 lists
Suppose you have two lists of words -- either stored in files or in command line interpreter environment variables -- and you would like to know which members are common to both lists. Or, alternatively, you might want to find words NOT duplicated.This is easily accomplished with the uniq command. This command has an option that will suppress the output of lines that are NOT duplicate (or that are if you use a different option). To see which words appear in both lists, merely combine the lists, sort the result, and feed the output to uniq -d. To see only the words that are not duplicated, use the -u option to uniq.
For example, suppose you have the following files containing word lists:
- File one contains the following
roopaand
susan
fred
nagaraja
tom
hank - File two contains the following
sridar
bill
fred
hank
roopa
hank
sort -u one >one.sorted
sort -u two >two.sorted
sort one.sorted two.sorted | uniq -d
fred
hank
roopa
Deleting specific lines from a file
Sometimes reports or lists have lines in them which are known to be unneeded. There are three basic approaches to discarding them:- Use a while loop to iterate over the list and discard the unneeded items. This is likely to be very slow.
- Use grep -v to filter based on patterns on the lines.
- Use sed line number ranges to discard them.
Since grep is covered elsewhere, this section will discuss using sed to delete the lines. Given that a single sed command can be an intermingling of string replaces with deletes, learning to delete with sed can greatly speed up scripts.
Normally, the sed "string replace" command is the most commonly used. But sed has several other commands of interest:
- delete (d)
- print (p)
All sed commands, even string replace, can be restricted to certain lines in the file -- that is, certain ranges of lines.
For example, if you need to delete the first 10 lines in a file, you can write a sed command like this:
sed -e '1,10d' <file1 >file2
sed -e '20,$d' <file1 >file2
In both these cases, the delete command is restrictred to specific ranges of lines in the input file. In first case, the lines deleted are in the range, "from one to ten", and in the second case the range is "from 20 to the end of the file". The last line of the file is represented in a sed range by the character "$". This is not a regular expression, it is a just a symbol for the end of the file.
But, sed also lets you process lines in a range defined by a beginning regular expression and an ending regular expression. For example:
sed -e '/fred/,/bill/d' <file1 >file2
String substitutions can also be restricted to ranges: for example, suppose you want to replace numbers with #'s -- but only on lines beginning with frank, you could do this:
sed -e '/frank/s/[0-9]\+/####/g' <file1 >file2
It is also possible to delete all the lines in the sed input file. For example:
sed -e 'd'
sed -e '/fred/p' -e 'd'
Sed is so important in script writing that it's man page
deserves repeated viewing.
Iterating over Directories
Normally, when processing files, you want to skip over the directories -- the
Windows "for" command in cmd.exe does this for you automatically, but in
sh, bash, and
ksh, you must do this yourself. Here's how:
for f in *
do
if [ ! -d "$f" ]
then
echo "$f" is not a directory
fi
done
On Windows, here's how to operate on directories in cmd.exe:
FOR /D %variable IN (set) DO command [command-parameters]
Parsing lines of text in the interpreter
The bourne shell family of intepreters -- bash, sh, and ksh -- perform text parsing in several situations. They maintain an environment variable, IFS, which means Inter Field Separator, which controls how lines are split up into words. Parsing occurs whenever you invoke a command or shell function, whenever you execute a "for" statement, or whenever you execute the "set" statement. The set statement's purpose is primarily to let you override the script's command line options as the script executes but it also allows you to parse strings as if they were command lines. It stores the parsed tokens in the standard variables: $1, $2, etc.On Windows, cmd.exe, doesn't actually parse in this same manner -- but, the for command does have an option, /F, that lets you parse and split the lines in a file according to user specified delimiters.
Parsing using these builtin features is very clunky and requires a lot of practice
but understanding that such parsing is possible can eliminate the need to run
separate programs. This can greatly speed up a script that does a lot of tinkering
with text -- and it can make that script run fast enough that you don't feel it necessary
to rewrite the script as a program.
Toolset
The standard scripting helper programs are described below. Executables
and source code for these programs can be found at
sourceforge.net -- search for the programs individually or the package named gnuwin32
to get the whole bundle.
Microsoft provides a free package called Services For Unix (SFU) which contains these same commands (and does include ksh). The cygwin distribution contains bash as well as all the other programs mentioned here.
- sh, bash, ksh, cmd.exe
-
( The bourne shell, bash (Bourne again shell), Korn shell,
and the MS Windows command line intepreter)
These "shells" or command line interpreters exist primarily to allow programs to be executed with user defined arguments but each in its own way has some text processing features that eliminate the need to run extra programs to get string manipulations performed.
The "sh" variants, sh, bash, and ksh are actually very powerful programming environments in their own right -- although they are a bit sluggish compared to a regular program -- or even Perl.
Note that while cmd.exe is not as powerful of a command interpreter as the above, it does have some builtin string substitutions, described here, that can replace the uses of basename and dirname page. See sh man page, ksh man page, bash man page, or cmd.exe man page.
- ls, dir
- Lists the names of files in a directory in various formats. If you are on Windows, there is "dir" program command line option that will let you just list the names of the files in the directory with no other information. Use that in place of ls, in the examples below. See ls man page,
- grep (Global Regular Expression Parser)
- This program lets you select lines from files that match complex patterns. After the shell, this is the single most useful program for scripting purposes. See gnu grep man page,
- sed (String EDitor)
- This program lets your perform very complex string replacements on
lines in a file -- or on ranges of lines in a file. Sed documentation
can be found at The sed Home Page.
See gnu sed man page.
Note that the GNU version of sed can handle far longer lines in files than the standard unix version. You should get the GNU version from sourceforge.net.
Be advised that text files often have tabs in them and that is rarely easy to write a sed expression that envolves tabs. The best thing to do is to use the expand program to convert tabs into spaces. Unfortunately, most text editors don't use tabs as a data compression technique -- instead they use them as a formatting technique. Thus, expanding tabs may mis-align the text compared to that which is seen in a text editor.
- tr
- The tr command's primary function is to perform character by character substitutions on its input data then write the modified data to its standard output file. This command lets you specify character sets in the substitution logic. See gnu tr man page.
- echo
- (Prints is command line arguments to stdout). Print the command line to stdout -- and more importantly, can be used to echo files matching a pattern. This can be faster than ls but has simpler output capabilities. See echo man page.
- find (Find files)
- Prints the names of files in a directory tree that match a pattern.
See find man page.
Note that when you use cmd.exe, the "for" command has the ability to do the same things:
For /R %var in ( patterns ) do echo %var
- cut (Splits lines in a file into fields)
- The cut command lets you select fields from a the lines in a file and print them to stdout in any desired order. You can split based on character positions or based on delimiter characters. See cut man page.
- csplit (Context Split)
- Splits up a single file into many files based on sections defined by regular expressions. See csplit man page.
- sort
- Sorts, merges, and filters duplicate lines in a file. This command has many non-obvious uses. Can sort based on fields in the input data. Can treat fields as numbers and sort on them in numerical rather than character order. See sort man page.
- uniq
- Assumes its input file has been sorted and then filters either duplicates or non-duplicates depending on command line options. See man page.
- basename (remove the directory part of a pathname)
- Prints the file basename part of its command line argument. Basename can also be used to string the file name extension off of a filename -- leaving only the root part of the filename. When using cmd.exe, use its builtin string substitutions instead. See basename man page.
- dirname (remove filename part of a pathname)
- Strips the filename part off of its command line parameter -- leaving only the directory name -- ending in /. When using cmd.exe, use its builtin string substitutions instead. See dirname man page.
- expand (expand tabs into spaces)
- The expand programs reads its standard input file, expands tabs in
each line, then writes the expanded text to its standard output.
It is often necessary to use the expand program as the first stage
of a long pipeline in order for subsequence stages to work
correctly -- tab characters are notoriously hard to pass on
command lines.
See expand man page.
Unfortunately, most text editors don't use tabs as a data compression technique -- instead they use them as a formatting technique. Thus, expanding tabs may mis-align the text compared to that which is seen in a text editor.
Also note the unexpand program, described below, which is used to put tabs into the file instead of taking them out.
- unexpand (compress with tabs)
- The unexpand program replaces leading blanks with tabs in groups of eight. There's rarely a reason to do this, but should you want to, this is how. See unexpand man page.
- fold
- The fold program splits the lines in its standard input file into multiple smaller lines as needed to fit them into a fixed width format. For example, you might use this to limit line length to 60 characters, or 160. The -s option allows you to cause the line splitting to occur on word boundaries. See fold man page.
- xargs
- This program reads words (not lines!) from stdin and assumes that each is meant
to be used as a parameter to a program which is specified on the xargs
command line. It then formats invocations to the program using one or more
words per call so that eventually all words in the standard input file to
xargs gets passed to the program.
See xargs man page.
Here's an example:
#
# sh, bash, and ksh example
#
find . -name '*' -print | xargs echoxargs then suffers from a serious flaw -- if your filenames have spaces in the name, you just can't use it. Instead you could write your own xargs program that puts quotes around the parameters before invoking echo. Other approaches are possible.
On the other hand, if your filenames don't have spaces in them, xargs works great just like it is.
- perl
- Yuck. Read a book if you are interested. I won't mention it again (Well, maybe once: perl lets you replace strings which span the boundaries of lines in a text file.) See perl documents page.