Quick Jumps
defining sections
text transformations
conditionals
variables
syntax
commands
example applications

lbsplit


lbsplit is is a program used in command line text processing scripts to extract text sections from log files and to perform transformations on that text using sed like commands. lbsplit is thus similar to csplit, but should run faster since the sections are not necessarily written to log files.

The script language provided by lbsplit is deliberately terse so as to allow the development of many scripts using just the command line and the shell's command line re-call and editing features. The language itself does support comments, but this is mainly useful for scripts that are stored in files.

lbsplit's command language is more declarative than procedural. The commands are generally of the form:

When you see this, do that.
However, the order of execution matters and there is a while loop, so the language is not strictly declarative.

Why use lbsplit?

Other languages, such as perl, sed, and python, allow you to write the same kinds of scripts as lbsplit, why would you bother with a new language?

While full scale programming languages let you do the same things that lbsplit lets you do (and of course many more), it is possible to write many useful scripts more quickly and more surely using lbsplit's doman specific language. lbsplit provides a limited command language specific to finding text sections and performing regular expression substitutions on them.

It has been said about regular expressions, that if you need regular expressions to solve your problem, you actually have two problems:

Regular expressions are horrible to look at, but they really are not horrible to learn to create. There are many tutorials for regular expressions available on the internet -- a google search for "regular expression tutorial" should provide a long list of introductions to the subject.

lbsplit uses grep style "basic regular expressions". There are also internet articles discussing the difference between basic and extended regular expressions.

Performance wise, lbsplit is approximately equivalent to sed.

In addition to regular expression transformations on the text in sections, lbsplit provides a few other commands and features:

lbsplit has pre-coded algorithms for detecting text sections so the user's job is to:

  1. decide which kinds of sections occur in the data stream
  2. create section definitions
  3. within each section, include text transformation commands

One could of course do all of this and more using csplit combined with sed, bash, or perl. lbsplit, however, can perform all its operations in memory if desired -- so it is faster than combining csplit with sed or bash. The algorithms for selecting sections is pre-coded in lbsplit, so it will be easier to specify many simple transformations using lbsplit than with perl or sed.


How does lbsplit work?

top
By default, lbsplit will copy its input files to stdout. However, if you define textual sections, either on the command line or store them in script file, those sections can be modified or deleted before printing.

In its most common usage, lbsplit will be invoked with the -n option which will instruct it not to print any text which is not part of a user defined section.

For example:


    lbsplit file.txt                     # prints file.txt to stdout

    lbsplit -n file.txt                  # prints nothing to stdout

    lbsplit -n *.txt -S '{ /./,/./ }'    # prints ONLY the first non-blank line 
					 # found in any .txt files in the current 
					 # directory.  Note:  they are all 
					 # concatenated together into a single
					 # stream by this command, at most one
					 # line will be printed (total) by this
					 # command

    lbsplit file -F script.lbs           # processes file using the script 
					 # found in file "lbscript.lbs"

    lbsplit file -F - <<EOF              # uppercase the parts of file that
    { /begin/,/end/                      # lie between the first line that 
      y/a-z/A-Z/;                        # contains "begin" and the first line
    }                                    # thereafter that contains "end",
    EOF

    lbsplit - -FH 3 3<<EOF               # read the input from stdin 
    { /^.*$/w                            # and read the script from file handle 3
      commands applied to the whole      # as defined by the shell invoking lbsplit
      file.
    }
    EOF



Note that if a section definition uses the F; command action, lbsplit will write the contents of that section to a file whose name is computed at run time. This option allows lbsplit to operate similarly to the csplit command. For example:


    lbsplit -prefix /fred/yy -n *.txt -S '{ /begin/,/end/ F; }+'

In this case, begin/end sections found in *.txt will be written to files named like the following:
    ~: ls -l fred
    total 47
    -rw-r--r-- 1 user grp 1 2004-03-13 18:02 yy00000000
    -rw-r--r-- 1 user grp 1 2004-03-13 18:02 yy00000001
    -rw-r--r-- 1 user grp 1 2004-03-13 18:02 yy00000002
    ...

The number of digits in the generated filename is controlled with the -N command line option.

When sections are written to files, they still get processed using the section's command actions -- including discarding them with the d; and q; commands.


Section Types

top

Defining individual sections

lbsplit has an internal state machine that can detect and process the following kinds of sections:
  1. Sections begining with a line detectable via regular expression and which end with a line containing a second regular expression. For example, the command
    
    	   lbsplit -n file.txt -S '{ /begin/,/end/ ... }'
    
           
    Whill recognize this kind of section:
    
    	       text begin section 1
    	       .....
    	       other text end section 1
    
           
  2. Sections which are defined by a beginning regular expression, but which have no particular ending regular expression. In this case case, a second instance of the beginning regular expression (or the end of the input stream) is deemed to end the section. For example, this section definition:
    
    	   lbsplit -n file.txt -S '{ /Page/ } {/Page/ ...} '
    
           
    When applied to this input data:
    
    	   Page 1
    	     lines
    	     on  page
    	     one
    	   Page 2
    	     lines on page 2
    
           
    Will select the following for the first section:
    
               Page 1
                 lines
                 on  page
                 one
    
           
    If there had been NO next section, then the first /Page/ section would have included everying in the file starting with the first Page line!

  3. Sections comprised of a contiguous block of lines all sharing the same regular expression (tabular data). For example:
    
    	   lbsplit -n file.txt -S '{ /a/w  }'
    
           
    When applied to this data stream:
    
    	   stuff
    	   a line
    	   one at a time
    	   final
    	   but not
    	   these
           
    Will select:
    
               a line
               one at a time
               final
           
    Because all these lines contain the letter a.

    Note that this is the kind of section that can be used to process the entire file in a single section. For example:

    
    	    {  /^.*$/w
    
    	       # print every line in the file
    
    	    }
    
           
  4. A selector section lets you define multiple sessions and lbsplit will dynamically select and activate the first one that it encounters in the input stream. For example:
    
    	   lbsplit -n file.txt -S '?{ 
    					 { /a/w           ... }
    					 { /begin/,/end/  ... }
    					 { /Page/         ... }
    					 ...
    					}'
    
           

Defining multiple sections

When lbsplit is looking for a section, it processes the lines of text which are not part of the section using the default logic. The default behavior is to print the lines to stdout (or not depending on the -n option).

A given input stream might have multiple sections of interest. To deal with, simply add multiple section definitions. Lines occuring before, between, and after the sections will be processed using the default logic. Here are several ways of saying the same thing:


	lbsplit -n someFile.txt  -S '{ /first/,/last/ ... } { /secondStart/ ...}'
    OR
	lbsplit -n someFile.txt  -S '
	{ /first/,/last/ ... } 
	{ /secondStart/ ...}
        '
    OR
	lbsplit -n someFile.txt  -S '{ /first/,/last/ ... }' \
				    '{ /secondStart/ ... }'
    OR
	lbsplit -n someFile.txt  -F - <<EOF
	{ /first/,/last/ 
	  ...
        } 
	{ /secondStart/ 
	  ...
        }
	EOF
    OR
	lbsplit -n someFile.txt  -FH 3<<EOF
	{ /first/,/last/ 
	  ...
        } 
	{ /secondStart/ 
	  ...
        }
	EOF

All of the above invocations are equivalent.

Note that sections which have only 1 regular expression defining their bounds, and which don't employ the 'w' operator, and which are followed by another section, will end when the next section starts. So, if the input text looks like this:


    line1
    line2
    line3
    Weird stuff
      w1
      w2
    Rest
      r1
      ...

And you wish to only upper case the Weird stuff, write the definitions like this:

    lbsplit file.txt -S '{ /Weird/  y/a-z/A-Z/; }'  '{/Rest/}'

Here, lbsplit will copy the file.txt to stdout in its entireity. However, will detect the section begining with Weird and ending with Rest, but not including Rest. Those lines will be uppercased.

Lines before Weird, and the lines starting with Rest will simply be copied to stdout.

Defining repeated sections

In addition to definining multiple sections of different types, as above, it is also possible to define repeated sections. A repeated section is defined by appending a trailing repeat count after the last } in the section. Repeat counts can be either explicit numbers, like 12 or just the +. character that means "infinite repetitions". For example:

    lbsplit -n somefile -S '{ /begin/,/end/ ... }12  {/other/ ... }+'

Certain combinations of repeated sections won't play together nicely. These are detected by the lbsplit parser and an explanatory error message is produced.

Prefix and Suffix sections

lbsplit has command line options that let you define a prefix and a suffix section. These sections can be used to print extraneous text, but are mainly meant to let you initialize variables which are shared across all sections. For example:

    lbsplit -n -px '{...}' -sx '{...}' somefile -S ...

Since all variables are initially initialized to empty string, there is no reason to have a prefix section just for that. But if other variables are needed, use a section defined something like this:

    ... -px '{/./  |varname|l/VALUE/; }' ...

The regular expression is not important in this case, so /./ will do fine.

The variable, varname, is being initialized with VALUE. The 'l' or "load" command is the easiest way to populate variables from a script.


Text Transformation Command Overview

top
The unix csplit command has the ability to extract text sections. lbsplit adds the ability to perform a variety of transformations on the text in a section so as to avoid having to launch sed or a bash script on each section.

The following basic types of transformations are available and are discussed in the following paragraphs:

Textual Subsitutions on section lines

A sed-like regular expression substitution command is the most powerful tool in the lbsplit arsenal. It can be used in a section definition like this:

    lbsplit file -S  '{ /^Page/   s/fred/bill/g; }+'

This particular invocation would copy 'file' to stdout and would replace the string 'fred' with 'bill' on every line after (and including) the first line it finds that has 'Page' starting in column 1.

If the + operator were left off the end of the section definition, the subsitution would only occur beteen the first line beginning with "Page" and the next line beginning with it.

The regular expression substitution language works like sed's regular expression behavior. See the man page for ex, ed, or sed, for details.

Note that the following escape characters can be used on both the left and right hand sides of the substitute command:

One difference between lbsplit and sed is that in the right hand side, the & character is not recognized as referring to the entire matched pattern. For that, you must use "\0".

Another important difference is that lbsplit has named variables. The syntax,


    \{varname}

can be used in the left and right hand sides of the substitution to include variables. Variables are described later but can either be set by the section commands themselves or can be inherited from the environment.

Note: The substitute command in lbsplit only allows '1' or 'g' to follow the trailing / in the right hand side of the command. This means that you are restricted to substituting either 1 or all of the left hand sides with the right hand side.

Translating Character Sets

Uppercasing and other character set translations can be accomplished using the y command:

    ... -S '{ /^Page/    y/a-z/A-Z/;' }'

Here, everything between and including the first instance of Page and either the end of the file, the next section, will be uppercased.

Tabs

lbsplit provides two commands that manipulate tabs in the lines being processed:

    t;                     # expands tabs into spaces
    T;                     # compresses leading spaces with tabs

Columns

lbsplit provides a "cut" command which allows columns from the lines being processed to be selected. The cut command is used something like this:

    c 1,3,9-12,4;  

Here, everything in the current line or variable will be deleted except for the contents of columns 1, 3, 9-12, and 4. Note that column 4 will now appear at the end instead of between 3 and 9.

Automatic Numbering

lbsplit provides 4 kinds of automatic numbering commands: In all cases, the information is pre-pended to the current line or variable and is followed by a tab before the original line's content. With the f; command, a tab separates the filename and the line number.

The substitute command can be used to move the number to some other part of the line (or variable). Note that the string "\t" can be used to refer to the tab.

Text Justification

lbsplit provides two text justification commands: These commands treat the current line (or variable) as a single string and add padding to it to bring its width to a minimum of 'number'. The lowercase 'j' command adds spaces to the right, and the uppercase 'J' command adds them to the left. For example:

    lbsplit -n file -S '{ /fred/,/bill    J40; }'

This script ensures that leading blanks are added to every line in the section bounded by fred and bill so that they are at least 40 characters wide.

Loading a line or variable with text

Obviously, the substitute command can be used to replace the current line or variable with text. For example:

    ... -S '{ /fred/,/bill  s/.*/SOME TEXT/1; }'

However, this approach has some annoyances: A simpler (though not simple) command is as follows:

    ... -S '{ /fred/,/bill  l/SOME TEXT/; }'

Here, the current line replaced with SOME TEXT, regardless of what is in it.

End of Line Handling

lbsplit allows the end of line to be handled differently in each section. Normally, output lines are followed by '\n' to separate the lines. However, there is a command that lets the end of line string be user defined on a per-section basis:

    lbsplit -n file -S '{ /fred/,/bill  E/-glarf\n/; }'

Here, all the text between fred and bill (inclusive) will be printed to stdout and -glarf will be added to the end of each line.

Note that the you must include the \n if you use the E command unless you want to all the text in the section to join together in a single long line -- and there is occassionally a use for this behavior.


Conditional Execution

top
lbsplit does not provide an if, or for statement, but it does allow the execution of command actions to be based on contents of the section -- and upon the contents of variables. There are the mechanisms for doing this:

Here are some examples:

   ... -S '{ /x/,/y/

	     # note you can put end of line comments here

	     1,13 { d; } # discard lines 1 through 13

	     95 q;  # quit processing this whole section
		    # if we get to line 95

	     18,$ { P; d; }  # print and delete everything
			     # on and after line 18

	     /fred/,/tom/ d; # delete the lines between fred and tom

	     /./ P;  # print the current line or variable if is not empty

	     |var|/./{ actions; } # execute actions if the variable is not
				  # empty

	     w/regex/action # while regex is true of the current line,
			    # execute the action

	     w!/regex/action # while not true...

	   }+'
    

Range conditionals

The range conditions, shown in the above example, span multiple lines if they have two conditions: a begin line condition and an end line condition.

Once activated, the actions in the conditional are applied to all lines until the de-activation condition occurs. The actions are applied to that line, though, so you might need to use "not" clauses, discussed below, in your actions to preven unwanted actions on the last line...

And clauses in conditions

Conditionals can be combined to achieve an "and" clause:

   ... -S '{ /x/,/y/

	     1 /fred/ d;  # if fred appears on line 1, delete it

	     2,14 |var|/tom/ q;   # if on lines 2 - 14, the variable,
				  # "var" contains "tom", quit 
				  # processing this entire section
           }+'
    
The "not"operator is supported on range conditionals:

   ... -S '{ /x/,/y/

	     !1 { d; }            # delete all lines but 1

	     !2,14 { d; }         # delete all lines but 2-14

	     !/fred/ s/a/A/;      # on all lines but those containing
				  # fred, convert a to A.
           }+'
    

If statement (approximation)

The easiest way to simulate an if-statement is to use a one line regex range and perform actions if the regex matches. For example:

   ... -S '{ /x/,/y/

	     /regex/ { "if-statement is true about current line , do stuff" }

	     |var|/./ { "var is not empty here, you must have put stuff in it" 
			" thus simulating a boolean variable"
		      }

           }+'
    


Variables

top

Variable Syntax

Variables are strings with names and they are shared across all sections in a script. Variables are used in three principle ways:
  1. They are substituted into strings whenever \{varname} is found (most places anyway). For example:
    
          s/from this/to \{varname}/g;
    
          /\{varname}/  d;   # if this line contains the specified variable's
    			 # contents, then delete it.
    
    
  2. Variables can be populated by extracting (parsing) them from them from a line of text:
    
          g/someRegex/<varlist>
    
    
    Where the varlist is defined like one of the following: This basically lets you use regular expressions to parse the current line or variable into parts (stored in named variables). The variable assignmes work like this:
    1. The first variable gets the contents of the entire text from the current line or variable that matches the regular expressions.
    2. each remaing variable is either cleared or its set to the part of the line that matched a parenthetical sub-expression. For example, suppose the regex looked like this:
      leading part \( \(part one\) part 2 \) more \( part three \)
      In this case, the first variable would get the whole match:
      leading part part one part 2 more part three
      And the next variable would get the entireity of the parenthetical group containing a sub-group:
      part one part 2
      And the third variable would get only the contents that matched first nested sub-expression of that group:
      part one
      Nothing would get "more"

      And the fourth variable would get
      part three
  3. They serve as the "current line" in command action contexts. Some command actions only apply to variables:
    
          |varname|=;      # set the variable named "varname"
    		       # to the current line's contents.
    
          |bill| |sue|=;   # make sue equal bill
    
          |hank|+;         # append the current line to hank
    		       # with a leading \n.
    
          |fred|P;         # print the contents of fred with
    		       # current end of line sequence
    
          |var| { ...; }   # apply a sequence commands using
    		       # var as the current line instead
    		       # of using the current input line.
    
          |z|l//;          # clear variable z;
    
          |secno|{l//;N;}  # store the current section number
    		       # followed by a tab in variable
    		       # secno.
    
          m;               # replace the current line or variable
    		       # with the contents of the variable
    		       # found in the current line or variable.
    
    

Variable Uses

Variables are primarily useful for storing text for later use. For example, you can grab parts of a line and repeat them again later in the same section or another:

   ... -S '{ /x/,/y/

	     /^Title/ |savedTitle|=;  # record whole title line

	     |savedTitle|s/^Title//1; # remove the word title from
				      # the variable.

	     s/^.*$/\{savedTitle}\t\0/1; # prepend the title to each
					 # subsequent line
	  }'

	 '{ /other/,/stuff/
	    s/.*/\{savedTitle}: \0/1; # title saved in earlier section
	  }'

This is particularly helpful in suppressing the printing of sections whose contents have offending data. For example:

   ... -S '{ /^Page/

	     B { |var|l//;}    # before the section starts, clear the variable
	     A { |var|/./ P;}  # if the variable is not empty at end of section
			       # then print it.

	     |var|+; # append each line of the section to the variable
		     # with appropriate line separators

	     /trigger/{ |var|l//; q;}  # if the line contains 'trigger'
				       # then do NOT print this section.
				       # set var to empty and exit the section
				       # which leave nothing for the after
				       # clause, defined with A above, to do.
		     

	     d;      # suppress printing of this section
	  }+'




Syntax

top

Command line syntax

Command line formats fall into one of the following forms:

    lbsplit [options] [files] -S sectionDef ...
    
    lbsplit [options] [files] -F scriptFile

    lbsplit [options] [files] -FH <number> <number><<EOF
    ...
    EOF

Note that the file named '-' refers to stdin. The scriptFile can also be specified as '-' meaning stdin. Only one use of '-' on the command line is allowed.

Note also that the -FH option requires that the shell that invokes lbsplit open a file containing the script and the -FH option requires the file number of that opened file. Only the bourne family of shells support this capability.

Options are as follows:

-n
Suppress the default output of lines not part of sections.
-N d
When automatically generating numbers use 'd' as the format width.
-sx s
Use s as the suffix section for the entire run.
-px p
Use p as the prefix section for the entire run.
-prefix p
Use p as the filename prefix when outputting file using the F; command action.
-v
Print the program version information.
-d
Print debugging information.
-D
Print more debugging information that -d

Language Grammar

Approximate BNF grammar for the lbsplit script language:

    script          :=  {  sectionDef  }

    sectionDef      := '{' boundingRegexes {action} '}' [repetitions]

    boundingRegexes :=  regex [reOpts] [ ',' regex ] [reOpts]

    regex           :=  { '/', ':', '%' } RE { '/', ':', '%' } 

    RE              :=  "a sed compatible regular expression" 

    reOpts          :=  { 'i' | 'w' | '!' }

    action          :=  [ varRef ] actionCommand ';'

    varRef          := '|' varName '|'

    varName         :=  "the name of a variable"

    actionCommand   :=  "see table below"




Commands

top
{ commands }
A group of command actions all which execute when the group is executed.
|var|+;
Append the current line to the variable named "var" with a line separator in between.
|var|=;
Replace the variable named "var" with the contents of the current line or variable;
A action;
Define the 'after' behavior for this section. When the section terminates, for whatever reason, the action gets executed. See the '{' for action groups. The most common use for this feature is the printing of variables initialized during the section.

For before and after sections for the entire file, see the -px and -sx options

B action;
Define the 'before' behavior for this section. Before the first line of the section is processed using the normal command actions in this section, the before action gets executed (with "" as the current line text). See the '{' action for groups of actions to be performed. The primary reason for using this command is to initialize variables.

For before and after sections for the entire file, see the -px and -sx options

c cutset;
Cut out all parts of the current line or variable which are not specified in the cutset. The cutset is a string of the form:
1-3,4,9-20,5,...
That is, it is a list of column references. All parts of the current line or variable not listed will be removed. The remainders will be concatenated together into a single string and left as the current line or variable.
d;
Stop processing this line -- and don't print it using the default end of line string for this section. This command is often used with a conditional, or it is used unconditionally at the end of the action list to merely suppress output of this entire section.
E/str/;
Specify the end of line string for this section. Normally the end of line string is just \n. However, if you want to join all the lines in the section together, you can set the end of line string to be empty, or some other characters (for example ",", or "|" or something). Note that instead of only using / as a delimiter, you could also use '%", or ":".
F;
Force the output of this section to go to a new file name which is based on the current section number. The location of the output file is affected by the -prefix command line option.
f;
Prefix the current line or variable with the file name and line number within that file (input file) -- followed by a tab character and with an tab between the file and line.
g/regex/<varlist>;
Get variables from the current line or variable context using regular expressions to parse the text. Here's how to populate a variable with the text in a line that maches a given regular expression:

    g/x.*y/wholeMatch;

	 
If either the line or variable context contains the substring xsomethingy, the variable, wholeMatch, will contain "xsomethingy". Otherwise, it will contain, "". If the regex has \( ... \) in it then other variables may be populated. for example:

    g/x\(.*\)y/wholeMatch|middle;

         
In this case, wholeMatch, will get "xsomethingy", and variable middle, will get "something". Assuming the same data as before, of course.

Regular expression language allows multiple and nested \( ... \) groups. The assignment of text from the match to the variables is done in a strictly left-to-right fashion, with the first variable, getting the whole match (pretending that an outer \(...\) group enclosed the whole expression. After that, the first \( found in the expression goes with the second variable names.

Variables which which get no data are filled with "".

I;
Prefix the current line or variable with the current line number within the entire input stream.
j number;
J number;
Left or right justify the current line or variable. 'j' left justifies within a field of spaces specified by the number. 'J' right justifies within the field.
l/txt/;
Load the current line or variable with the specified text. The text is variable expanded before use. The '%' and the ':' characters can be used instead of the '/' character if so desired.
m;
Map the current line or variable's contents to the contents of the variable whose name is stored in the line.
n;
Prefix the current line or variable's contents with the line number within the section and a tab.
N;
Prefix the current line or variable's contents with the current section number and a tab.
p/txt;
Print the specified text after first variable expanding it. Use the current end of line string for this section.
P;
Print the current line or variable's contents and follow it by the current end of line sequence. This behavior naturally occurs anyway at the end of the actions for the section. The reason the command exists is to that you can print the text under certain circumstances, and by default not print anything (by the d; command).
r;
r file;
Replace the current line with the contents of a file. If the file is specified in the command, use the variable expanded form of the file name as the file to read. If not, read the file whose name is the entire current line.

As usual, the single character file name, -, refers to the stdin.

s/lhs/rhs/o;
Substitute 1 or more instances of the regular expression, lhs, with the variable expanded form of rhs obeying options o. The regular expression is compatible with sed, not egrep or perl. You use \|, \(, \), to get access to those regular expression features.

The rhs string cannot use the & operator from sed, but you can use \0 which does the same thing, and \1 - \9 to handle matching sub-string replacements. Please consult the unix man page for regexp, for details.

The options are as follow:

As usual, the delimiter, /, can be replaced with either %, or :.

t;
Expand tabs in the current line or variable.
T;
Compress leading blanks with tabs in the current line or variables.
y/set1/set2/;
Translate characters from set1 to characters in set2. Note that the delimiter can also be % or :, not just \.

Here is an example translation that makes all the characters uppercase:

y/a-z/A-Z/;
w /regex/ action;
w!/regex/ action;
While the regex is true of the current line, execute the action(s). If the ! operator is specified, then while the regex is not true, execute the specified action(s);


Example Application Walkthroughs

top
The following examples show how to use lbsplit in realistic programming examples.

Extracting Valgrind Suppressions

lbsplit can be used to extract valgrind's automatically generated suppression statements from the torrent of messages valgrind produces as it runs.

Valgrind is a diagnostic tool that can detect memory mis-uses in a a program under development. With a listing of said mis-uses, the program's quality can be improved by fixing the code to eliminate the mistakes.

However, in large programs, some mis-uses are harmless, even if repeated a lot. Valgrind's utility diminishes if its output is filled with items that are ignorable. Valgrind provides a command line option, --gen-suppressions that produce "suppressions" which can then be fed back into valgrind during a subsequent debugging sessions to eliminate each individual memory misuse report.

However, these suppressions are entangled with lots of other output from valgrind and have to be hand edited to create a proper suppressions file use by subsequent valgrind runs. lbsplit can automatically extract the generated suppressions.

Here is a snippet from a valgrind output:

==18301== Memcheck, a memory error detector.
...
==18301== Using valgrind-3.3.0-Debian, a dynamic binary instrumentation framework.
==18301== 
==18301== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 1)
...
==18301== checked 183,984 bytes.
==18301== 
==18301== 64 bytes in 2 blocks are definitely lost in loss record 4 of 6
==18301==    at 0x4C22FAB: malloc (vg_replace_malloc.c:207)
==18301==    by 0x40CAE8: regex_compile (in /home/lboggs/projects/lbsplit/lbsplit)
==18301==    by 0x417416: regcomp (in /home/lboggs/projects/lbsplit/lbsplit)
==18301==    by 0x40A645: compile_helper(char const*, re_pattern_buffer*) (simple_regex.cpp:73)
==18301==    by 0x40AB91: SimpleRegex::SimpleRegex(std::string const&) (simple_regex.cpp:101)
==18301==    by 0x404301: Section::Section(std::string const&) (section.h:81)
==18301==    by 0x40311B: recordSection(std::string const&) (lbsplit.cpp:239)
==18301==    by 0x4034E6: main (lbsplit.cpp:78)
{
   <insert a suppression name here>
   Memcheck:Leak
   fun:malloc
   fun:regex_compile
   fun:regcomp
   fun:_Z14compile_helperPKcP17re_pattern_buffer
   fun:_ZN11SimpleRegexC1ERKSs
   fun:_ZN7SectionC1ERKSs
   fun:_Z13recordSectionRKSs
   fun:main
}
==18301== 
==18301== 772 bytes in 1 blocks are definitely lost in loss record 6 of 6
==18301==    at 0x4C23809: operator new(unsigned long) (vg_replace_malloc.c:230)
==18301==    by 0x40310A: recordSection(std::string const&) (lbsplit.cpp:239)
==18301==    by 0x4034E6: main (lbsplit.cpp:78)
{
   <insert a suppression name here>
   Memcheck:Leak
   fun:_Znwm
   fun:_Z13recordSectionRKSs
   fun:_static_initialization_
}
==18301== 
...

The valgrind suppressions are the lines of text beginning with { and ending with }. An lbsplit script to extract them and automatically insert a unique suppression name is shown here:

lbsplit -n vg.log  -F - <<EOF

{ /^{/,/^}/

  #
  #  Process sections from valgrind log files
  #  that contain automatically generated suppressions
  #

  /<insert/{
	     #
	     # The second line of the suppression section
	     # is a note telling you to insert a unique
	     # name for this suppression
	     #
	     s/.*//1;       # delete the note
	     N;             # insert the section number
	     s/.*/   L\0/1; # prefix it with L to make it a name
	     s/\t *$//g;    # remove trailing tab
	   }

  # let the lines in the suppression print as normal
}+

EOF
Ignoring comments and blank lines, a total of 6 statements are required.

And here is an example output from the above script:

{
   L1	
   Memcheck:Leak
   fun:malloc
   fun:regex_compile
   fun:regcomp
   fun:_Z14compile_helperPKcP17re_pattern_buffer
   fun:_ZN11SimpleRegexC1ERKSs
   fun:_ZN7SectionC1ERKSs
   fun:_Z13recordSectionRKSs
   fun:main
}
{
   L2	
   Memcheck:Leak
   fun:_Znwm
   fun:_Z13recordSectionRKSs
   fun:_static_initialization_
}

Discarding Uninteresting Sections

Continuing the theme of detecting valgrind suppressions, let us detect interesting suppressions and ignore those that are not interesting. Practially speaking, the interesting sections might be discarded, or the might be kept -- depending on the situation.

In the previous example, only Memcheck:Leak suppressions are shown, but in practice, many different kinds of valgrind messages occur. While it is desirable to correct all program mistakes, sometimes it isn't practical to prevent them all. We might choose to ignore certain memory leaks and focus on others.

The Valgrind log prints a stack trace of of the function call that led to the program bug. One time leaks are probably not interesting. One time leaks often occur during static intitialization either of the program as a whole, or when shared libraries (DLLs) are loaded. Valgrind stack traces can usually indicate the presense of static or DLL initialization by the inclusion of the string _static_init somewhere in a stack trace. For example:

    
    {
      SomeError:type
      func1
      caller1
      callter2
      _static_initialization_0
    }

We definitely want to suppress this kind of memory leak when running Valgrind, but otherwise we almost surely want to fix the leaks.

So, we basically want to modify the above example so that it ONLY prints suppressions for leaks containing a line with the string _static_init on it. Here's how:

lbsplit -n vg.log  -F - <<EOF

{ /^{/,/^}/

  #
  # process valgrind suppressions
  #

  /^{/ |save|+;    # save the first line for output

  /<insert/,$       
  {
      #
      # only if there is an <insert do we save the rest of the section
      #
      |save|+;  
  }

  /_static_/{ |doit|=; }  # only if we have a static init do we trigger output

  d; # turn off normal printing of this section

  A{ |doit|/./|save|P;   # after the section, print it if we are supposed to.
     |save|l//;          # clear the variables for the next section.
     |doit|l//; 
   }
}+
EOF

Cleaning up g++ error messages

The problem

The g++ compiler produces error messages of the following form when describing errors using templates:

file.cpp:412 bits/basic_string.h:504: note:             \
   std::basic_string<_CharT, _Traits, _Alloc>&          \
   std::basic_string<_CharT, _Traits, _Alloc>::         \
   operator=(const _CharT*)                             \
   [with _CharT = char,                                 \
   _Traits = std::char_traits<char>,                    \
   _Alloc = std::allocator<char>] <near match>

The backslashes indicate line continuation. In practice, g++ produces all this output into one long line.

In addition to the line length annoyances, which can't be fixed, the error message has two basic problems:

  1. std::basic_string was most likely not the string used by the developer when writing the program. std::string most likely was.
  2. The above error message is expressed in terms of the template's symbolic names for its parameters -- not the actual types filled in for them. This can be useful if you are debugging the template itself, but if you are a normal user of an existing template, it only confuses the issue -- especially since the _Traits and _Alloc parameters are generally defaulted, or hidden in typedefs. The normal developer wants to see std::string, not the fully exploded, text of even this completely defaulted type.

What most people, then, would like to see instead of the above error message would be something more like this:


file.cpp:412 bits/basic_string.h:504: note: std::string& std::string:: operator=(const char*)   

Of course, there's no guarantee that message will fit on one line in a text edit session.

Steps toward a solution

Luckily, the compiler does provide all the needed information on every line. That information is found in the text that looks like this:

   [with SymName = typeExpression, SymName2 = typeExpression2, ... ]

Cleaning up the error message lines then consists of two parts:
  1. replace all references on the line to the various SymNames with the corresponding typeExpression.
  2. replace common patterns in the resultant with their standard forms. For example,

    Standard Form Actual template signature
    std::string std::basic_string<char, std::char_traits<char>, std::allocator<char> >
    std::wstring std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >
    std::vector<T> std::vector<T, std::allocator<T> >
Here is an lbsplit code fragment that repeatedly substitutes the the [with clause fragments back into the body of the line:

    /\[with /
    {
      # get rid of some unhelp explanatory text that occaisionally
      # complicates the error messages

      s/  *<near match> *$//g;   
      s/<](.*/>]/g;              

      # now repleatedly substitute the typeExpressions for their
      # symbolic names

      w /\[with [^\]\+]$/
      {
	  # parse the end of the [with statement into variables
	  # ...  ,  = ]

	  g/ \([a-zA-Z0-9_]\+\) *= \([^=\]*\)] *$/match|name|value;
      
	  #p/MATCH=\{match}\nNAME=\{name}\nVALUE=\{value}/;
      
	  s/[, ]*\{match}$/]/g;
      
	  s/\<\{name}\>/\{value}/g;
      }
      
      s/\[with *] *$//g;  # get rid of the final [with ]

    }

The above code fragment will transform the original problematic line of text, see above, into this this:

file.cpp:412 bits/basic_string.h:504: note:                               \
   std::basic_string<char, std::char_traits<char>, std::allocator<char>>& \
   std::basic_string<_CharT, _Traits, _Alloc>::                           \
   operator=(const char*)

Again, the backslashes imply line continuation, and, in fact, all this text comes out on a single long line.

In this particular example, and indeed with most references to the basic_string template, we will won't to see std::string.

But, you can't, however, go around converting all basic_string template references into standard string because it is possible, theoretically at least, that the signature won't exactly match the standard pattern. A programmer could define a propriatry string class that uses the basic_string template, but specifies a non-standard allocator or traits class. In this case, the error message cleanup algorithm we are creating should not blindly convert these special cases into std::string!

You can get away with a substitution like this, most of the time:

s/std::basic_string<char, std::char_traits<char>, std::allocator<char> *>/std::string/g; This fully converts the basic_string template to is expected format:

file.cpp:412 bits/basic_string.h:504: note: std::string& std::string::operator=(const char*)

But this only works because std::string isn't flexible -- it must be basic_string<char>. You can't do a similar substitution to solve the problem of most templates -- which really are used like templates.

And the rest

Is left as an exercise for the user. Sorry, this is an example, not a product.