Quick Jumps defining sections text transformations conditionals variables syntax commands example applications |
The script language provided by lbsplit is deliberately terse so as to allow the development of many scripts using just the command line and the shell's command line re-call and editing features. The language itself does support comments, but this is mainly useful for scripts that are stored in files.
lbsplit's command language is more declarative than procedural. The commands are generally of the form:
When you see this, do that.However, the order of execution matters and there is a while loop, so the language is not strictly declarative.
While full scale programming languages let you do the same things that lbsplit lets you do (and of course many more), it is possible to write many useful scripts more quickly and more surely using lbsplit's doman specific language. lbsplit provides a limited command language specific to finding text sections and performing regular expression substitutions on them.
It has been said about regular expressions, that if you need regular expressions to solve your problem, you actually have two problems:
lbsplit uses grep style "basic regular expressions". There are also internet articles discussing the difference between basic and extended regular expressions.
Performance wise, lbsplit is approximately equivalent to sed.
In addition to regular expression transformations on the text in sections, lbsplit provides a few other commands and features:
lbsplit has pre-coded algorithms for detecting text sections so the user's job is to:
One could of course do all of this and more using csplit combined with sed, bash, or perl. lbsplit, however, can perform all its operations in memory if desired -- so it is faster than combining csplit with sed or bash. The algorithms for selecting sections is pre-coded in lbsplit, so it will be easier to specify many simple transformations using lbsplit than with perl or sed.
How does lbsplit work? |
top |
In its most common usage, lbsplit will be invoked with the -n option which will instruct it not to print any text which is not part of a user defined section.
For example:
lbsplit file.txt # prints file.txt to stdout
lbsplit -n file.txt # prints nothing to stdout
lbsplit -n *.txt -S '{ /./,/./ }' # prints ONLY the first non-blank line
# found in any .txt files in the current
# directory. Note: they are all
# concatenated together into a single
# stream by this command, at most one
# line will be printed (total) by this
# command
lbsplit file -F script.lbs # processes file using the script
# found in file "lbscript.lbs"
lbsplit file -F - <<EOF # uppercase the parts of file that
{ /begin/,/end/ # lie between the first line that
y/a-z/A-Z/; # contains "begin" and the first line
} # thereafter that contains "end",
EOF
lbsplit - -FH 3 3<<EOF # read the input from stdin
{ /^.*$/w # and read the script from file handle 3
commands applied to the whole # as defined by the shell invoking lbsplit
file.
}
EOF
Note that if a section definition uses the F; command action, lbsplit will
write the contents of that section to a file whose name is computed at run
time. This option allows lbsplit to operate similarly to the csplit command.
For example:
lbsplit -prefix /fred/yy -n *.txt -S '{ /begin/,/end/ F; }+'
In this case, begin/end sections found in *.txt will be written to files
named like the following:
~: ls -l fred total 47 -rw-r--r-- 1 user grp 1 2004-03-13 18:02 yy00000000 -rw-r--r-- 1 user grp 1 2004-03-13 18:02 yy00000001 -rw-r--r-- 1 user grp 1 2004-03-13 18:02 yy00000002 ...The number of digits in the generated filename is controlled with the -N command line option.
When sections are written to files, they still get processed using the section's command actions -- including discarding them with the d; and q; commands.
Section Types |
top |
lbsplit -n file.txt -S '{ /begin/,/end/ ... }'Whill recognize this kind of section:
text begin section 1 ..... other text end section 1
lbsplit -n file.txt -S '{ /Page/ } {/Page/ ...} 'When applied to this input data:
Page 1 lines on page one Page 2 lines on page 2Will select the following for the first section:
Page 1 lines on page oneIf there had been NO next section, then the first /Page/ section would have included everying in the file starting with the first Page line!
lbsplit -n file.txt -S '{ /a/w }'When applied to this data stream:
stuff a line one at a time final but not theseWill select:
a line one at a time finalBecause all these lines contain the letter a.
Note that this is the kind of section that can be used to process the entire file in a single section. For example:
{ /^.*$/w # print every line in the file }
lbsplit -n file.txt -S '?{ { /a/w ... } { /begin/,/end/ ... } { /Page/ ... } ... }'
A given input stream might have multiple sections of interest. To deal with, simply
add multiple section definitions. Lines occuring before, between, and after the
sections will be processed using the default logic. Here are several ways of
saying the same thing:
lbsplit -n someFile.txt -S '{ /first/,/last/ ... } { /secondStart/ ...}'
OR
lbsplit -n someFile.txt -S '
{ /first/,/last/ ... }
{ /secondStart/ ...}
'
OR
lbsplit -n someFile.txt -S '{ /first/,/last/ ... }' \
'{ /secondStart/ ... }'
OR
lbsplit -n someFile.txt -F - <<EOF
{ /first/,/last/
...
}
{ /secondStart/
...
}
EOF
OR
lbsplit -n someFile.txt -FH 3<<EOF
{ /first/,/last/
...
}
{ /secondStart/
...
}
EOF
All of the above invocations are equivalent.
Note that sections which have only 1 regular expression defining their bounds, and which don't employ the 'w' operator, and which are followed by another section, will end when the next section starts. So, if the input text looks like this:
line1 line2 line3 Weird stuff w1 w2 Rest r1 ...And you wish to only upper case the Weird stuff, write the definitions like this:
lbsplit file.txt -S '{ /Weird/ y/a-z/A-Z/; }' '{/Rest/}'Here, lbsplit will copy the file.txt to stdout in its entireity. However, will detect the section begining with Weird and ending with Rest, but not including Rest. Those lines will be uppercased.
Lines before Weird, and the lines starting with Rest will simply be copied to stdout.
lbsplit -n somefile -S '{ /begin/,/end/ ... }12 {/other/ ... }+'Certain combinations of repeated sections won't play together nicely. These are detected by the lbsplit parser and an explanatory error message is produced.
lbsplit -n -px '{...}' -sx '{...}' somefile -S ...Since all variables are initially initialized to empty string, there is no reason to have a prefix section just for that. But if other variables are needed, use a section defined something like this:
... -px '{/./ |varname|l/VALUE/; }' ...The regular expression is not important in this case, so /./ will do fine.
The variable, varname, is being initialized with VALUE. The 'l' or "load" command is the easiest way to populate variables from a script.
Text Transformation Command Overview |
top |
The following basic types of transformations are available and are discussed in the following paragraphs:
lbsplit file -S '{ /^Page/ s/fred/bill/g; }+'This particular invocation would copy 'file' to stdout and would replace the string 'fred' with 'bill' on every line after (and including) the first line it finds that has 'Page' starting in column 1.
If the + operator were left off the end of the section definition, the subsitution would only occur beteen the first line beginning with "Page" and the next line beginning with it.
The regular expression substitution language works like sed's regular expression behavior. See the man page for ex, ed, or sed, for details.
Note that the following escape characters can be used on both the left and right hand sides of the substitute command:
Here are some common octal character representations:
\040 space \011 tab \012 new line \015 carriage return \0200 First non-ascii character \0377 last 1 byte non-ascii character
One difference between lbsplit and sed is that in the right hand side, the & character is not recognized as referring to the entire matched pattern. For that, you must use "\0".
Another important difference is that lbsplit has named variables. The syntax,
\{varname}
can be used in the left and right hand sides of the substitution to include variables.
Variables are described later but can either be set by the section commands themselves
or can be inherited from the environment.
Note: The substitute command in lbsplit only allows '1' or 'g' to follow the trailing / in the right hand side of the command. This means that you are restricted to substituting either 1 or all of the left hand sides with the right hand side.
... -S '{ /^Page/ y/a-z/A-Z/;' }'Here, everything between and including the first instance of Page and either the end of the file, the next section, will be uppercased.
t; # expands tabs into spaces T; # compresses leading spaces with tabs
c 1,3,9-12,4;Here, everything in the current line or variable will be deleted except for the contents of columns 1, 3, 9-12, and 4. Note that column 4 will now appear at the end instead of between 3 and 9.
The substitute command can be used to move the number to some other part of the line (or variable). Note that the string "\t" can be used to refer to the tab.
lbsplit -n file -S '{ /fred/,/bill J40; }'This script ensures that leading blanks are added to every line in the section bounded by fred and bill so that they are at least 40 characters wide.
... -S '{ /fred/,/bill s/.*/SOME TEXT/1; }'However, this approach has some annoyances:
... -S '{ /fred/,/bill l/SOME TEXT/; }'Here, the current line replaced with SOME TEXT, regardless of what is in it.
lbsplit -n file -S '{ /fred/,/bill E/-glarf\n/; }'Here, all the text between fred and bill (inclusive) will be printed to stdout and -glarf will be added to the end of each line.
Note that the you must include the \n if you use the E command unless you want to all the text in the section to join together in a single long line -- and there is occassionally a use for this behavior.
Conditional Execution |
top |
... -S '{ /x/,/y/ # note you can put end of line comments here 1,13 { d; } # discard lines 1 through 13 95 q; # quit processing this whole section # if we get to line 95 18,$ { P; d; } # print and delete everything # on and after line 18 /fred/,/tom/ d; # delete the lines between fred and tom /./ P; # print the current line or variable if is not empty |var|/./{ actions; } # execute actions if the variable is not # empty w/regex/action # while regex is true of the current line, # execute the action w!/regex/action # while not true... }+'
Once activated, the actions in the conditional are applied to all lines until the de-activation condition occurs. The actions are applied to that line, though, so you might need to use "not" clauses, discussed below, in your actions to preven unwanted actions on the last line...
... -S '{ /x/,/y/ 1 /fred/ d; # if fred appears on line 1, delete it 2,14 |var|/tom/ q; # if on lines 2 - 14, the variable, # "var" contains "tom", quit # processing this entire section }+'The "not"operator is supported on range conditionals:
... -S '{ /x/,/y/ !1 { d; } # delete all lines but 1 !2,14 { d; } # delete all lines but 2-14 !/fred/ s/a/A/; # on all lines but those containing # fred, convert a to A. }+'
... -S '{ /x/,/y/ /regex/ { "if-statement is true about current line , do stuff" } |var|/./ { "var is not empty here, you must have put stuff in it" " thus simulating a boolean variable" } }+'
Variables |
top |
s/from this/to \{varname}/g; /\{varname}/ d; # if this line contains the specified variable's # contents, then delete it.
g/someRegex/<varlist>Where the varlist is defined like one of the following:
leading part \( \(part one\) part 2 \) more \( part three \)In this case, the first variable would get the whole match:
leading part part one part 2 more part threeAnd the next variable would get the entireity of the parenthetical group containing a sub-group:
part one part 2And the third variable would get only the contents that matched first nested sub-expression of that group:
part oneNothing would get "more"
part three
|varname|=; # set the variable named "varname" # to the current line's contents. |bill| |sue|=; # make sue equal bill |hank|+; # append the current line to hank # with a leading \n. |fred|P; # print the contents of fred with # current end of line sequence |var| { ...; } # apply a sequence commands using # var as the current line instead # of using the current input line. |z|l//; # clear variable z; |secno|{l//;N;} # store the current section number # followed by a tab in variable # secno. m; # replace the current line or variable # with the contents of the variable # found in the current line or variable.
... -S '{ /x/,/y/ /^Title/ |savedTitle|=; # record whole title line |savedTitle|s/^Title//1; # remove the word title from # the variable. s/^.*$/\{savedTitle}\t\0/1; # prepend the title to each # subsequent line }' '{ /other/,/stuff/ s/.*/\{savedTitle}: \0/1; # title saved in earlier section }'This is particularly helpful in suppressing the printing of sections whose contents have offending data. For example:
... -S '{ /^Page/ B { |var|l//;} # before the section starts, clear the variable A { |var|/./ P;} # if the variable is not empty at end of section # then print it. |var|+; # append each line of the section to the variable # with appropriate line separators /trigger/{ |var|l//; q;} # if the line contains 'trigger' # then do NOT print this section. # set var to empty and exit the section # which leave nothing for the after # clause, defined with A above, to do. d; # suppress printing of this section }+'
Syntax |
top |
lbsplit [options] [files] -S sectionDef ... lbsplit [options] [files] -F scriptFile lbsplit [options] [files] -FH <number> <number><<EOF ... EOFNote that the file named '-' refers to stdin. The scriptFile can also be specified as '-' meaning stdin. Only one use of '-' on the command line is allowed.
Note also that the -FH option requires that the shell that invokes lbsplit open a file containing the script and the -FH option requires the file number of that opened file. Only the bourne family of shells support this capability.
Options are as follows:
script := { sectionDef } sectionDef := '{' boundingRegexes {action} '}' [repetitions] boundingRegexes := regex [reOpts] [ ',' regex ] [reOpts] regex := { '/', ':', '%' } RE { '/', ':', '%' } RE := "a sed compatible regular expression" reOpts := { 'i' | 'w' | '!' } action := [ varRef ] actionCommand ';' varRef := '|' varName '|' varName := "the name of a variable" actionCommand := "see table below"
Commands |
top |
For before and after sections for the entire file, see the -px and -sx options
For before and after sections for the entire file, see the -px and -sx options
1-3,4,9-20,5,...That is, it is a list of column references. All parts of the current line or variable not listed will be removed. The remainders will be concatenated together into a single string and left as the current line or variable.
g/x.*y/wholeMatch;If either the line or variable context contains the substring xsomethingy, the variable, wholeMatch, will contain "xsomethingy". Otherwise, it will contain, "". If the regex has \( ... \) in it then other variables may be populated. for example:
g/x\(.*\)y/wholeMatch|middle;In this case, wholeMatch, will get "xsomethingy", and variable middle, will get "something". Assuming the same data as before, of course.
Regular expression language allows multiple and nested \( ... \) groups. The assignment of text from the match to the variables is done in a strictly left-to-right fashion, with the first variable, getting the whole match (pretending that an outer \(...\) group enclosed the whole expression. After that, the first \( found in the expression goes with the second variable names.
Variables which which get no data are filled with "".
As usual, the single character file name, -, refers to the stdin.
The rhs string cannot use the & operator from sed, but you can use \0 which does the same thing, and \1 - \9 to handle matching sub-string replacements. Please consult the unix man page for regexp, for details.
The options are as follow:
As usual, the delimiter, /, can be replaced with either %, or :.
Here is an example translation that makes all the characters uppercase:
y/a-z/A-Z/;
Example Application Walkthroughs |
top |
Valgrind is a diagnostic tool that can detect memory mis-uses in a a program under development. With a listing of said mis-uses, the program's quality can be improved by fixing the code to eliminate the mistakes.
However, in large programs, some mis-uses are harmless, even if repeated a lot. Valgrind's utility diminishes if its output is filled with items that are ignorable. Valgrind provides a command line option, --gen-suppressions that produce "suppressions" which can then be fed back into valgrind during a subsequent debugging sessions to eliminate each individual memory misuse report.
However, these suppressions are entangled with lots of other output from valgrind and have to be hand edited to create a proper suppressions file use by subsequent valgrind runs. lbsplit can automatically extract the generated suppressions.
Here is a snippet from a valgrind output:
==18301== Memcheck, a memory error detector. ... ==18301== Using valgrind-3.3.0-Debian, a dynamic binary instrumentation framework. ==18301== ==18301== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 1) ... ==18301== checked 183,984 bytes. ==18301== ==18301== 64 bytes in 2 blocks are definitely lost in loss record 4 of 6 ==18301== at 0x4C22FAB: malloc (vg_replace_malloc.c:207) ==18301== by 0x40CAE8: regex_compile (in /home/lboggs/projects/lbsplit/lbsplit) ==18301== by 0x417416: regcomp (in /home/lboggs/projects/lbsplit/lbsplit) ==18301== by 0x40A645: compile_helper(char const*, re_pattern_buffer*) (simple_regex.cpp:73) ==18301== by 0x40AB91: SimpleRegex::SimpleRegex(std::string const&) (simple_regex.cpp:101) ==18301== by 0x404301: Section::Section(std::string const&) (section.h:81) ==18301== by 0x40311B: recordSection(std::string const&) (lbsplit.cpp:239) ==18301== by 0x4034E6: main (lbsplit.cpp:78) { <insert a suppression name here> Memcheck:Leak fun:malloc fun:regex_compile fun:regcomp fun:_Z14compile_helperPKcP17re_pattern_buffer fun:_ZN11SimpleRegexC1ERKSs fun:_ZN7SectionC1ERKSs fun:_Z13recordSectionRKSs fun:main } ==18301== ==18301== 772 bytes in 1 blocks are definitely lost in loss record 6 of 6 ==18301== at 0x4C23809: operator new(unsigned long) (vg_replace_malloc.c:230) ==18301== by 0x40310A: recordSection(std::string const&) (lbsplit.cpp:239) ==18301== by 0x4034E6: main (lbsplit.cpp:78) { <insert a suppression name here> Memcheck:Leak fun:_Znwm fun:_Z13recordSectionRKSs fun:_static_initialization_ } ==18301== ...
The valgrind suppressions are the lines of text beginning with { and ending
with }. An lbsplit script to extract them and automatically insert a unique
suppression name is shown here:
lbsplit -n vg.log -F - <<EOF
{ /^{/,/^}/
#
# Process sections from valgrind log files
# that contain automatically generated suppressions
#
/<insert/{
#
# The second line of the suppression section
# is a note telling you to insert a unique
# name for this suppression
#
s/.*//1; # delete the note
N; # insert the section number
s/.*/ L\0/1; # prefix it with L to make it a name
s/\t *$//g; # remove trailing tab
}
# let the lines in the suppression print as normal
}+
EOF
Ignoring comments and blank lines, a total of 6 statements
are required.
And here is an example output from the above script:
{
L1
Memcheck:Leak
fun:malloc
fun:regex_compile
fun:regcomp
fun:_Z14compile_helperPKcP17re_pattern_buffer
fun:_ZN11SimpleRegexC1ERKSs
fun:_ZN7SectionC1ERKSs
fun:_Z13recordSectionRKSs
fun:main
}
{
L2
Memcheck:Leak
fun:_Znwm
fun:_Z13recordSectionRKSs
fun:_static_initialization_
}
In the previous example, only Memcheck:Leak suppressions are shown, but in practice, many different kinds of valgrind messages occur. While it is desirable to correct all program mistakes, sometimes it isn't practical to prevent them all. We might choose to ignore certain memory leaks and focus on others.
The Valgrind log prints a stack trace of of the function call that led to the program bug. One time leaks are probably not interesting. One time leaks often occur during static intitialization either of the program as a whole, or when shared libraries (DLLs) are loaded. Valgrind stack traces can usually indicate the presense of static or DLL initialization by the inclusion of the string _static_init somewhere in a stack trace. For example:
{ SomeError:type func1 caller1 callter2 _static_initialization_0 }We definitely want to suppress this kind of memory leak when running Valgrind, but otherwise we almost surely want to fix the leaks.
So, we basically want to modify the above example so that it ONLY prints
suppressions for leaks containing a line with the string _static_init on it.
Here's how:
lbsplit -n vg.log -F - <<EOF
{ /^{/,/^}/
#
# process valgrind suppressions
#
/^{/ |save|+; # save the first line for output
/<insert/,$
{
#
# only if there is an <insert do we save the rest of the section
#
|save|+;
}
/_static_/{ |doit|=; } # only if we have a static init do we trigger output
d; # turn off normal printing of this section
A{ |doit|/./|save|P; # after the section, print it if we are supposed to.
|save|l//; # clear the variables for the next section.
|doit|l//;
}
}+
EOF
file.cpp:412 bits/basic_string.h:504: note: \ std::basic_string<_CharT, _Traits, _Alloc>& \ std::basic_string<_CharT, _Traits, _Alloc>:: \ operator=(const _CharT*) \ [with _CharT = char, \ _Traits = std::char_traits<char>, \ _Alloc = std::allocator<char>] <near match>The backslashes indicate line continuation. In practice, g++ produces all this output into one long line.
In addition to the line length annoyances, which can't be fixed, the error message has two basic problems:
What most people, then, would like to see instead of the above error message would be something more like this:
file.cpp:412 bits/basic_string.h:504: note: std::string& std::string:: operator=(const char*)Of course, there's no guarantee that message will fit on one line in a text edit session.
[with SymName = typeExpression, SymName2 = typeExpression2, ... ]Cleaning up the error message lines then consists of two parts:
Standard Form | Actual template signature |
---|---|
std::string | std::basic_string<char, std::char_traits<char>, std::allocator<char> > |
std::wstring | std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > |
std::vector<T> | std::vector<T, std::allocator<T> > |
/\[with / { # get rid of some unhelp explanatory text that occaisionally # complicates the error messages s/ *<near match> *$//g; s/<](.*/>]/g; # now repleatedly substitute the typeExpressions for their # symbolic names w /\[with [^\]\+]$/ { # parse the end of the [with statement into variables # ... ,The above code fragment will transform the original problematic line of text, see above, into this this:= ] g/ \([a-zA-Z0-9_]\+\) *= \([^=\]*\)] *$/match|name|value; #p/MATCH=\{match}\nNAME=\{name}\nVALUE=\{value}/; s/[, ]*\{match}$/]/g; s/\<\{name}\>/\{value}/g; } s/\[with *] *$//g; # get rid of the final [with ] }
file.cpp:412 bits/basic_string.h:504: note: \ std::basic_string<char, std::char_traits<char>, std::allocator<char>>& \ std::basic_string<_CharT, _Traits, _Alloc>:: \ operator=(const char*)Again, the backslashes imply line continuation, and, in fact, all this text comes out on a single long line.
In this particular example, and indeed with most references to the basic_string template, we will won't to see std::string.
But, you can't, however, go around converting all basic_string template references into standard string because it is possible, theoretically at least, that the signature won't exactly match the standard pattern. A programmer could define a propriatry string class that uses the basic_string template, but specifies a non-standard allocator or traits class. In this case, the error message cleanup algorithm we are creating should not blindly convert these special cases into std::string!
You can get away with a substitution like this, most of the time:
file.cpp:412 bits/basic_string.h:504: note: std::string& std::string::operator=(const char*)But this only works because std::string isn't flexible -- it must be basic_string<char>. You can't do a similar substitution to solve the problem of most templates -- which really are used like templates.