Scripting

Scripting helper programs

cmd.exe variable substitutions

selecting text from lines

Eliminating line Breaks

Re-ordering text

Filtering duplicates

Matching words from 2 lists

Tabs characters in files

Deleting specific lines

Iterating over directories

Parsing text lines

Introduction

This page discusses some techniques that a programmer can use to avoid excessive handwork -- either in the case of one time tasks or in the automation of repeated tasks.

The best advice anyone can give is this: practice, practice, practice, and more practice! Rest a while, then practice some more!

The following paragraphs are based on the standard Unix text processing tools -- but since these programs are available on almost all platforms these days, that fact shouldn't prevent the suggestions here from being useful in many environments.

This page discusses techiques for writing scripts that work well when developed from the command line. That is, instead of initially bringing up a text editor, the scripts described here are easy to build and experiment with from the command line (sh, bash, ksh, cmd.exe etc).

Note that playing around with the command line is essential to good script development. You have to practice, practice, practice! The easiest way to do that is to just type in a command and see what it does. In many cases, command line scripts are both undocumented and undocumentable -- you just let your juices flow and produce a sequence of commands that works. If you are having to spend a lot of time designing, you should be using some other programming language that will allow all this work to be expressed at the full speed of a compiled language.

Of course, you'll eventually have to copy the command lines into a file and save the script for later use, but experimentation does not require it -- and you can learn a lot from these experiments. (In saying this, of course, I'm assuming that your command line interpreter has a way of recalling previously typed lines and editing before re-running them.)

Filter programming

There's an old joke that goes something like this: "If all you have is a hammer, the whole world looks like a nail." This joke makes fun of people who limit themselves to just one way of thinking.

However, mathmeticians do this all the time: they reduce really complex problems into a collection of little problems that they already know how to solve -- then they quit, because those problems are already solved. The speed and the ease of developing programs will benefit in this same way: reduce the big problem into a series of well understood steps, and you'll get it done much faster.

The principle technique that will be discussed in this page is called "filter programming." This basically means that programs are constructed as a series of filters that:

select only interesting parts of their standard input file and
transform those parts in some way before
writing to standard out.

A sequence of different filters are often strung together with command line piping to accomplish the whole task. Alternatively the output of steps can be written to temporary files as appropriate given the command line interpreter.

Filters, when combined with command line "for" loops can solve many real world problems quickly and easily.

The easiest way to use filter programs is to string program invocations together using the command line interpreter's pipe operator, "|". The pipe symbol universally means to take the standard output of one program and use it as the standard input to another. Consider:


		echo >fred.c 


		ls *.c | grep fred | sed -e "s/red/RED/g"

This particular pair of command lines should print the following:

fRED.c

This of course assumes that the user has write privilege to the current directory and can create or modify the file fred.c.

The first command, above, creates a file named "fred.c" in the current directory. The second command has 2 pipe symbols. The first pipe forces the output of the ls (list directory) command to be the input of the grep (string finding program). The second pipe symbol forces the output of grep to be the standard input of the sed (string replacing) command.

Now, if there were more files with the text "fred" in them, the output of the above pipe would also contain those names with the RED substitution.

A very, very, common piping sequence used in sh, bash, or ksh looks like this:


		grep ... | sed ... |            

		while read word ....            

		do                              

		   ... use word in a command ...

		done

Here's how the above should be interpreted:

grep is used select interesting lines from some file
sed is used to eliminate unnecessary text from the selected lines
the remaining text is fed line by line to the "while" command line interpreter command which splits the words on the lines into named pieces (variables) and processes each piece using additional shell commands

Of course, if you are using cmd.exe, you'll have different syntax:


		grep ... | sed ... >tmp              

		for /F %var in(tmp) do command %var%  

		del tmp

Note that the cmd.exe interpreter does provide a way to a run a program or command line pipe and process its output, line by line. Here's how:


		for /F "usebackq" %line in( `grep "fred"` ) do {

		  echo %line% 

		}

Note that the above command contains the "back quote" character, "`", not the forward quote, "'" character.

Warning: Sadly, the geniuses at Microsoft who added this useful option to the for command neglected to put in support for the | operator -- thus you cannot write this:


		for /F "usebackq" %line in( `ls *.c | grep "fred"` ) do {

		  echo %line% 

		}

Use "cmd.exe /c help for" for more details.

Note that filters don't necessarily reduce the amount of data that is being processed. For example, you could begin with a list of directory names fed to some filter and have it list all the files in each directory and pass that as the output -- thus greatly increasing the amount of data.

Don't get the idea that filter programming is only about text replacements -- although that is a key component. Many tasks that involve making lists and processing each item -- particularly in multiple steps -- are right for filter programming.

A key component of efficient script writing is understand which of the programs in the toolset mentioned below can be efficiently used to accomplish needed tasks. Not that efficiency is all that high of a priority goal. Any task that is only going to be performed once doesn't really need to be efficient. Tasks that can be run sometime between 2 and 3 am every night and will be done by 8 am are unlikely to need much efficiency either. Of course there are always exceptions -- just as there are other programming languages.

Example Script Fragments

Paragraphs in this section are devoted to many different kinds of script fragments and their explanations.

Processing all files with particular words in them

Lets say that you want to delete all the files in the current directory with the word "trash" in them. The fast way is to use grep to make a list of said files and process the files in that list -- but lets look at some other way's first:

The slow way -- invoke grep once for each file:


		#                           

		# sh, bash, and ksh example 

		#                           

		for f in *                  

		do          
                
		    grep -w trash "$f"      

		    if [ $? = 0 ]           

		    then                    
                    
		      rm -f "$f"            
		    
		    fi                      
		
		done

On Windows, using cmd.exe, you could write code like this:


			for %%f in ( *.* ) (
                
		    grep -w trash "%%f"      

		    if not errorlevel 1 (           

                    
		      rm -f "%%f"            
		    
		    )                      
		
		)

In this example, a for loop is used launch grep, once for each file and if grep sets its exit code to indicate that the file had the word trash in it (but not just the string trash!) then delete the file.

Here's a faster way that only invokes grep 1 time:


                #                           

                # sh, bash, and ksh example 

                #                           

		rm -f `grep -wcl trash *`

In this example, grep is invoked such that it only prints the names of files which contain the word "trash" and the output of the grep program is presented as a command line argument set to the rm command which will delete the files in the list.

Warning: This won't work properly if the files in the current directory have blanks in their names!

To solve the "blanks in the name" problem, you can write this script fragment:


                #                           

                # sh, bash, and ksh example 

                #                           

		grep -wcl trash * |         

		while read name             

		do                          
		
		    rm -f "$name"           
		
		done

These of course have been "unix" examples. To do the same thing without the unix command line interpreters, and if you have cmd.exe and have set the right magic variables to let it use its extended syntax, you can use this approach:


                #                           

		# windows example 

                #                           

		 for /F "usebackq" %var in (`grep -wcl trash *.*`) del %var%

Making lists from unwieldy log files

A common step in filter programming is to take the output of some program that some useful data in it and then deleting all the un-interesting stuff using grep and sed. You then either do further filtering and transformation, or apply some sort of for loop to the left overs.

For example, suppose you had a log file produced by a program designed to describe files in a way that is attractive for human beings, but you want to grab some of the data out of the log and use it as a list objects to process. Maybe you are using a configuration management system, such as Rational's "clearcase." It has a command feature called "describe" that produces all kinds of information about a file sitting in your directory. One of the many lines of output it produces is the is the name of the user who actually created the file version that you have sitting in your directory. Suppose your goal is to list the names of the files in your directory along with the user that created them and you only want to have look at that specific bit of information.

Lets further say the output from describe looks something like this (in reality clearcase produces a much different output, but we're just doing a thought experiment here.)


		File:     junk.cpp              

		Version:  /main/branch/LATEST   

		Created By: userBob             

		Description:
                
		    Blah blah blah              

		    more blah                   

		    and still more blah
		
		Derived from: /main/branch/14

The goal then is to use filter programming techniques to process each file in the directory using the describe command and pipe the output thereof to sequence of filters that produces output that looks like this:


		userBob   junk.cpp 

		frank     main.h   

		sysadmin  size.log

The basic filtering and transformation steps are:

Call the "describe" program on all files in the directory -- either one at a time using a for loop or in a single invocation, the describe program happens to support it -- lets pretend that it does not.
Extract from the output those lines associated with each file which are of interest. In this case, we are interested in the lines containing
- File:
- Created By:
We need these lines because they give us the filename and the user name for our final output format. We will of course have to delete the words "File:" and "Created By:" in the final output data.
We now have pairs of lines for each file and need to join them together to format the output data -- but the lines are in the wrong order! But how?

One way to join the pairs of lines would be to somehow read the lines in pairs and then print them to standard out using the echo command. If you are using cmd.exe this would require a lot of work -- unless there is some trick of the for command that I am not aware of. If you find out how, let me know. Here's a hint: "help for /F".

Even in sh, bash, or ksh, reading the lines in pairs and echoing them out would be very slow. Another way to accomplish the task would be to eliminate all breaks between lines and then use other clues -- such as the word "File:" at the beginning of each pair to help create the proper format. An easy way to get rid of the end of line markers is to use the tr command. It translates characters from its standard input file to its standard output file.

One translation that we could do would be to translate line breaks into spaces (joining all lines in the file into 1 long line)-- like this:


		tr '\n' ' ' <inputFile

Or we could delete them all together, like this:


		tr -d '\n'  <inputFile

The tr command is annoying in that you can't specify filenames on its command line -- you have to feed it using either a pipe or the < operator.

Let's say that we have a file named /tmp/one.txt and we want to eliminate the line breaks and store it in /tmp/two.txt. Here is how we'd write the tr invocation:


		tr '\n' ' ' </tmp/one.txt >/tmp/two.txt

Let's further suppose that one.txt had the following lines of text in it:


		File: f.c          

		Created By: bob     

		File: m.h            

		Created By: hank

The above tr invocation would leave the following contents in /tmp/two.txt


                   File: f.cpp Created By: Bob File: m.h Created By: hank

Given this line, we can now construct the output format we like using sed and tr (again). We need to use tr again so that we can put line breaks back in where they go!

The sed program is needed of course to eliminate the words "File:" and "Created By:" and is also needed to re-order the text so that the name of the creating user appears before the created file. We'll also leave a marker in the text, a single character, so that we can use tr to translate the marker into a line break.

This is a relatively sophisticated use of sed, but the general idea here is to replace all patterns that look like this:


		File: filename Created By: userName

With new text that looks like this:


		userName filename|

Where the "|" will be translated into a line break using tr.

Here is an invocation of sed that will accomplish the above task:


		sed -e "s/File: *\([^ ]\+\) *Created By: \([^ ]\+\)/\2 \1|/g"

This rather nasty command line employs a regular expression to select what text needs to be replaced as part of a command to sed which it instructs it to replace all instances of the matching expression with "\2 \1|". Which sed interprets as "the second item followed by a space followed by the first item followed by the or-bar (|)". The items are defined by the text inside the  grouping operators in the regular expression.

See this page for resources on how to invoke sed. It takes practice.

So, invoking sed using the above options on the file currently stored in /tmp/two.txt and writing the output to /tmp/three.txt will give the following contents (in /tmp/three.txt):


		Bob f.cpp|hank m.h|

And, we can then translate the or-bars (|) into line breaks like this:


		tr '|' '\n' </tmp/three.txt >/tmp/four.txt

And voila, we have the desired output format in /tmp/four.txt:


		Bob f.cpp 

		hank m.h

Finally, if command line length were no object, we could put all these steps in one single long command line. In sh, bash, and ksh, the pipe operator (|) lets us combine many command lines into one big command. In cmd.exe you are still limited to total line length, so you might not be able to turn all these steps into one giant pipe command. In bash, you'd probably end up with a command script that looks like this:


		for f in * ; do describe "$f" ; done | 

		grep -E '^ *File:|^ Created By:'     | 

		tr '\n' ' '                          | 

		sed ...                              | 

		tr '|' '\n'

See the sed command line, above.

Filtering duplicates

Often when using grep to extract text from files -- so as to construct lists -- many instances of a given word or string will appear. Sometimes this is good or harmless, but other times, this becomes a problem.

The sort program provides this service -- assuming that you use the -u option to eliminate duplicates. For example, suppose file /tmp/s1.txt has the following contents:


		line zero               

		one little endian       

		two little endians      

		three little endians    

		four little endians     

		last line

Further suppose that you were interested only in words that follow the word 'little'. If you use grep to extract lines containing "little", followed by sed to eliminate everything up to the word "little", you'd end up with this:


		little endian     

		little endians    

		little endians    

		little endians

If you then piped this output through sort -u you'd get


                little endian     

		little endians

Besides using complex sed commands to eliminate the text prior to "little" in the data above, the grep command might provide an easy way to do it. The GNU version of grep, has an option, -o, that limits its output ONLY to the parts that actually matched the regular expression. Other grep's may not have this feature and would behoove you, if your grep doesn't have this feature, to go to sourceforge.net to get the GNU grep -- even if you have to build it yourself.

Matching words from 2 lists

Suppose you have two lists of words -- either stored in files or in command line interpreter environment variables -- and you would like to know which members are common to both lists. Or, alternatively, you might want to find words NOT duplicated.

This is easily accomplished with the uniq command. This command has an option that will suppress the output of lines that are NOT duplicate (or that are if you use a different option). To see which words appear in both lists, merely combine the lists, sort the result, and feed the output to uniq -d. To see only the words that are not duplicated, use the -u option to uniq.

For example, suppose you have the following files containing word lists:

File one contains the following
roopa
susan
fred
nagaraja
tom
hank
and
File two contains the following
sridar
bill
fred
hank
roopa
hank

To see the common members, first use sort -u on both lists individually to remove any duplicates internal to the lists. Then sort both lists together but don't use -u, and feed the results to uniq -d to get only the duplicated items. Here's a script fragment doing same:


                sort -u one >one.sorted

		sort -u two >two.sorted

		sort one.sorted two.sorted | uniq -d

This script produces the following output


		fred 

		hank 

		roopa

Deleting specific lines from a file

Sometimes reports or lists have lines in them which are known to be unneeded. There are three basic approaches to discarding them:

Use a while loop to iterate over the list and discard the unneeded items. This is likely to be very slow.
Use grep -v to filter based on patterns on the lines.
Use sed line number ranges to discard them.

Sed can also be used to filter based on a regular expression.

Since grep is covered elsewhere, this section will discuss using sed to delete the lines. Given that a single sed command can be an intermingling of string replaces with deletes, learning to delete with sed can greatly speed up scripts.

Normally, the sed "string replace" command is the most commonly used. But sed has several other commands of interest:

delete (d)
print (p)

All sed commands, even string replace, can be restricted to certain lines in the file -- that is, certain ranges of lines.

For example, if you need to delete the first 10 lines in a file, you can write a sed command like this:


		sed -e '1,10d' <file1 >file2

Alternatively, if you know that only the first 20 lines of a file are interesting and the rest are trash, you could write:


		sed -e '20,$d' <file1 >file2

In both these cases, the delete command is restrictred to specific ranges of lines in the input file. In first case, the lines deleted are in the range, "from one to ten", and in the second case the range is "from 20 to the end of the file". The last line of the file is represented in a sed range by the character "$". This is not a regular expression, it is a just a symbol for the end of the file.

But, sed also lets you process lines in a range defined by a beginning regular expression and an ending regular expression. For example:


		sed -e '/fred/,/bill/d' <file1 >file2

That is, from the first line containing "fred" to the first line thereafter containing "bill". The regular expressions defining a range are surrounded by /'s.

String substitutions can also be restricted to ranges: for example, suppose you want to replace numbers with #'s -- but only on lines beginning with frank, you could do this:


		sed -e '/frank/s/[0-9]\+/####/g' <file1 >file2

It is also possible to delete all the lines in the sed input file. For example:


		sed -e 'd'

The print command and the delete command can be combined on the sed command line to simulate grep:


		sed -e '/fred/p' -e 'd'

Here, sed will print any line containing the word, "fred", and will delete all the others -- doing the same as grep fred.

Sed is so important in script writing that it's man page deserves repeated viewing.

Iterating over Directories

Normally, when processing files, you want to skip over the directories -- the Windows "for" command in cmd.exe does this for you automatically, but in sh, bash, and ksh, you must do this yourself. Here's how:


		for f in *                           

		do                                   
                
		    if [ ! -d "$f" ]                 

		    then                             
                    
			echo "$f" is not a directory 
		    
		    fi                              
		
		done

Of course, if you leave off the not operator, !, then the loop would process only directories.

On Windows, here's how to operate on directories in cmd.exe:



	      FOR /D %variable IN (set) DO command [command-parameters]

Parsing lines of text in the interpreter

The bourne shell family of intepreters -- bash, sh, and ksh -- perform text parsing in several situations. They maintain an environment variable, IFS, which means Inter Field Separator, which controls how lines are split up into words. Parsing occurs whenever you invoke a command or shell function, whenever you execute a "for" statement, or whenever you execute the "set" statement. The set statement's purpose is primarily to let you override the script's command line options as the script executes but it also allows you to parse strings as if they were command lines. It stores the parsed tokens in the standard variables: $1, $2, etc.

On Windows, cmd.exe, doesn't actually parse in this same manner -- but, the for command does have an option, /F, that lets you parse and split the lines in a file according to user specified delimiters.

Parsing using these builtin features is very clunky and requires a lot of practice but understanding that such parsing is possible can eliminate the need to run separate programs. This can greatly speed up a script that does a lot of tinkering with text -- and it can make that script run fast enough that you don't feel it necessary to rewrite the script as a program.

Toolset

The standard scripting helper programs are described below. Executables and source code for these programs can be found at sourceforge.net -- search for the programs individually or the package named gnuwin32 to get the whole bundle.

Microsoft provides a free package called Services For Unix (SFU) which contains these same commands (and does include ksh). The cygwin distribution contains bash as well as all the other programs mentioned here.

sh, bash, ksh, cmd.exe

( The bourne shell, bash (Bourne again shell), Korn shell, and the MS Windows command line intepreter)

These "shells" or command line interpreters exist primarily to allow programs to be executed with user defined arguments but each in its own way has some text processing features that eliminate the need to run extra programs to get string manipulations performed.

The "sh" variants, sh, bash, and ksh are actually very powerful programming environments in their own right -- although they are a bit sluggish compared to a regular program -- or even Perl.

Note that while cmd.exe is not as powerful of a command interpreter as the above, it does have some builtin string substitutions, described here, that can replace the uses of basename and dirname page. See sh man page, ksh man page, bash man page, or cmd.exe man page.

ls, dir

Lists the names of files in a directory in various formats. If you are on Windows, there is "dir" program command line option that will let you just list the names of the files in the directory with no other information. Use that in place of ls, in the examples below. See ls man page,

grep (Global Regular Expression Parser)

This program lets you select lines from files that match complex patterns. After the shell, this is the single most useful program for scripting purposes. See gnu grep man page,

sed (String EDitor)

This program lets your perform very complex string replacements on lines in a file -- or on ranges of lines in a file. Sed documentation can be found at The sed Home Page. See gnu sed man page.

Note that the GNU version of sed can handle far longer lines in files than the standard unix version. You should get the GNU version from sourceforge.net.

Be advised that text files often have tabs in them and that is rarely easy to write a sed expression that envolves tabs. The best thing to do is to use the expand program to convert tabs into spaces. Unfortunately, most text editors don't use tabs as a data compression technique -- instead they use them as a formatting technique. Thus, expanding tabs may mis-align the text compared to that which is seen in a text editor.

tr

The tr command's primary function is to perform character by character substitutions on its input data then write the modified data to its standard output file. This command lets you specify character sets in the substitution logic. See gnu tr man page.

echo

(Prints is command line arguments to stdout). Print the command line to stdout -- and more importantly, can be used to echo files matching a pattern. This can be faster than ls but has simpler output capabilities. See echo man page.

find (Find files)

Prints the names of files in a directory tree that match a pattern. See find man page.

Note that when you use cmd.exe, the "for" command has the ability to do the same things:


		       For /R %var in ( patterns ) do echo %var

cut (Splits lines in a file into fields)

The cut command lets you select fields from a the lines in a file and print them to stdout in any desired order. You can split based on character positions or based on delimiter characters. See cut man page.

csplit (Context Split)

Splits up a single file into many files based on sections defined by regular expressions. See csplit man page.

sort

Sorts, merges, and filters duplicate lines in a file. This command has many non-obvious uses. Can sort based on fields in the input data. Can treat fields as numbers and sort on them in numerical rather than character order. See sort man page.

uniq

Assumes its input file has been sorted and then filters either duplicates or non-duplicates depending on command line options. See man page.

basename (remove the directory part of a pathname)

Prints the file basename part of its command line argument. Basename can also be used to string the file name extension off of a filename -- leaving only the root part of the filename. When using cmd.exe, use its builtin string substitutions instead. See basename man page.

dirname (remove filename part of a pathname)

Strips the filename part off of its command line parameter -- leaving only the directory name -- ending in /. When using cmd.exe, use its builtin string substitutions instead. See dirname man page.

expand (expand tabs into spaces)

The expand programs reads its standard input file, expands tabs in each line, then writes the expanded text to its standard output. It is often necessary to use the expand program as the first stage of a long pipeline in order for subsequence stages to work correctly -- tab characters are notoriously hard to pass on command lines. See expand man page.

Unfortunately, most text editors don't use tabs as a data compression technique -- instead they use them as a formatting technique. Thus, expanding tabs may mis-align the text compared to that which is seen in a text editor.

Also note the unexpand program, described below, which is used to put tabs into the file instead of taking them out.

unexpand (compress with tabs)

The unexpand program replaces leading blanks with tabs in groups of eight. There's rarely a reason to do this, but should you want to, this is how. See unexpand man page.

fold

The fold program splits the lines in its standard input file into multiple smaller lines as needed to fit them into a fixed width format. For example, you might use this to limit line length to 60 characters, or 160. The -s option allows you to cause the line splitting to occur on word boundaries. See fold man page.

xargs

This program reads words (not lines!) from stdin and assumes that each is meant to be used as a parameter to a program which is specified on the xargs command line. It then formats invocations to the program using one or more words per call so that eventually all words in the standard input file to xargs gets passed to the program. See xargs man page.

Here's an example:


		      #                           

		      # sh, bash, and ksh example 

		      #                           

		      find . -name '*' -print | xargs echo

In this example, the find program walks the current directory and its subtrees and prints the names of all files to its standard out file which is redirected to the standard input file of the xargs program which will make multiple calls to the echo program until every file name has been printed. However, multiple files may be sent to the echo command in one invocation! Even though the find command produces one line per file, xargs reformats the text so that echo gets as many file names sent to it as can legally fit on the command line in this operating system.

xargs then suffers from a serious flaw -- if your filenames have spaces in the name, you just can't use it. Instead you could write your own xargs program that puts quotes around the parameters before invoking echo. Other approaches are possible.

On the other hand, if your filenames don't have spaces in them, xargs works great just like it is.

perl

Yuck. Read a book if you are interested. I won't mention it again (Well, maybe once: perl lets you replace strings which span the boundaries of lines in a text file.) See perl documents page.