lbsplit FAQ PAGE Summary of topics discussed: What is lbsplit? You can do this with csplit, perl, and sed already. Why lbsplit? What tasks is lbsplit targeted for? Does lbsplit have program variables? How do I change the text in the section before it gets printed? How do regular expression substitutions work? How can I write a case insensitive regular expression? My regular expression text has /'s in it, how do I make that work? Can I use lbsplit to parse C code blocks? (NO) How do I parse a line's contents into variables? How do I detect and process multiple sections in a file or stream? What kind of sections can lbsplit detect? How come I only got the first section to print out? How can I insert line numbers, section counts, etc? How could I number the lines in file? How could I simulate grep using lbsplit? How do I read from stdin? What kinds of text processing commands can I use? How to collapse entire sections into single lines? What kind of regular expression am I using? How to I select specific lines for textual transformations? How I constrain substitutions to particular lines? How do I write conditional statements? How do invert the logic of conditional statements? How do I delete, discard, or suppress a given line? How can I suppress the printing of empty or blank lines? How do I define repeated sections without cutting and pasting the text? How do I execute commands after the last line of the section? How do I execute commands before the first line of the section? How do I execute commands before the first section and after the last? How do I suppress the printing of the entire rest of the section? Why doesn't regex /./ match empty lines? How can I use variables in my scripts? How do I write "and" and "or" clauses using regular expressions? Why is my section only 1 line long? Why does my range condition action command apply to only 1 line? Why does my range condition include too many lines? How can I compute a range condition's regex? How can I quit processing the entire section? How can I filter out a section? How can I selectively filter out a section? How do I deal with columnar data? How do I expand tabs in my input data? How do read a file and print it? How do I write while-loop statements? What is lbsplit? - A stream editing program designed to detect blocks of lines and perform sed-like text editing commands on each. The grammar that lbsplit provides allows a lot simpler expressions of commonly used text processing commands because it allows for the definition of variables and the processing of variables as if they were the current line of text. You can do this with csplit, perl, and sed already. Why lbsplit? - csplit writes each block to a temporary file and you can use simple sed scripts to process them seperately, but this is very slow. - Perl and sed make you write complex scripting logic to detect the 3 kinds of blocks lbsplit already knows how to recognize. More work can be spent with these tools detecting the blocks than in actually performing the desired textual transformations. - lbsplit runs faster than perl in making these detections, and requires far less scripting to implement simple transformations. - lbsplit's command language is more declarative than procedural -- although a while command exists and section processing order is done in sequence. - on RedHat linux, grep is relatively slow on huge files. lbsplit (and sed) run a lot faster to do the same basic things. For small files, these is not so much speed advantage over grep, but on big files, the time difference can amount to several minutes (with either lbsplit or sed). There may not be so much speed advantage on other Linuxes. What tasks is lbsplit targeted for? - extracting text from log files and performing transformations on the individual blocks. For example: The valgrind program can automatically generate suppressions which look like this: { fun:f1 fun:f2 ... } This text is intermingled with all kinds of other logging information. lbsplit can be used to find all the {} pairs and print only them to stdout. And, you can instruct it to replace the with text which is specific to the block number. Here is the lbscript invocation that does that: lbsplit -n file -S '{ /^{/,/^}/ /::operator[] (int) [with T = int] The "[with T = int]" part explains what "T" means in the rest of the line. In this simple example, the error message isn't awfully difficult to understand, but in practice the lines are very long and may have as many as 20 "TypeName = ActualTypeExpression" fields in them. It is far simpler to understand the real error, if all the symbolic type names, the Ts in the above example, are replaced with the actual type expressions. Since lbsplit has regular expression parsing and variables as part of the language, the following snippet of lbsplit command language can be used to fix the above problems: w /\[with [^\]\+]$/ { g/ \([a-zA-Z0-9_]\+\) *= \([^=\]*\)] *$/match|name|value; s/[, ]*\{match}$/]/g; s/\<\{name}\>/\{value}/g; } s/\[with *] *$//g; This fragment, "greps" out the symbolic name "T" into variable "name", and the value of the type expression which T represents into variable "value". The entire "T = int" pattern, is stored inv variable "match". Next, it removes the matched pattern from the end of the line and finally replaces all references to the word "T" with the type expression, "int". Leaving, in the above example, f.c:27: Error, std::vector::operator[] (int) - Generalized text reformatting. For lovers of columnar data, lbsplit provides left and right justification of text in lines as well as substitution of line contents with the contents of variables. - Making decisions about multi-line blocks. For example, suppose your output looks like this: ==3192== Valgrind output message ==3192== line 1 some stuff ==3192== line 2 more stuff including this FLAG value ... In this case, lbsplit lets you buffer up this whole mess into variables and then if one of the lines, such as line2, contains a FLAG value, you can decide to discard the entire section, or print it, or perform a transformation on it, etc. Does lbsplit have program variables? - Yes but they are clunky. You are limited to the following actions: * store the current line in a variable ( |var|=; ) * append the current line to a variable (with an intervening \n). ( |var|+; ) * execute command actions on variables as if they were the current line - regular expression substitutions ( |var|s/fred/bill; ) - initialization ( |var| l/stuff/; ) - special commands that only work on variables ( |var| x; ) * assign variables from expanded constants (lets you combine multiple variables with intervening contant text) ( |var| l/ stuff \{var1} - more stuff \{var2} ... /; ) * print variables. ( |var|P; ) * use variables in regex substitions affecting the current line ( s/stuff\{var1}/ crap \{var2}\{var3} ... /g; * compute variable names and assign the data to the computed name ( S/computedName-\{helper}/Some computed value - \{value}/; ) * parsing text using regular expressions to pick out fields. ( g/regex/var1|var2|var3; ) lbsplit is not a full scale programming language and does not try to solve all the textual scripting problems that you face. Instead it tries to do one thing well: find and perform simple transformations on blocks of text from log files. See comments about variables in other sections, below. Note: you can initialize variables in a prefix section which is executed before the beginning of actual processing of your input stream. You can print them after the last line of the input stream is processed using a suffix section. These are defined on the command line like this: lbsplit -px "{prefixSection with Actions;}" -sx '{suffix section ...}' How do I change the text in the section before it gets printed? How do regular expression substitutions work? How can I write a case insensitive regular expression? - The substitute command has this syntax: spatternsubstitution Where: optional condition Selections which lines the substitution applies to. If no condition is specified, it applies to all lines. Conditions are defined like one of these: regex , ,$ Where is any of %, :, or /, and lineNumber is relative to the current section. NOTE: Before any range or condition, you can use the ! operator to invert the logic, so: !/fred/P; means print a duplicate line any line that does not contain fred. !2d; means delete all but line 2. !4,99P; print duplicates all lines except lines 4-99. delim is a character that defines a string boundary, you are limited to %, :, and / as delimiters. options Options can include following characters: i 1 g "i" is optional. Either "1" or "g" is required. 1 means only replace the first instance of the pattern with its substitute, g means to perform a global replace i means to perform a case insensitive substitution. pattern any regular expression -- sed style. Can contain escape characters like \r, \n, \t, \s, etc, and can contain regex special characters: \(, \|, \) Regexes are too complicated to describe here, but there are unix man pages, and you can look for this specifics: extended regular expressions, SED regular expressions, ed regular expressions, etc. Perl regexes are more complex and not supported. substitution any text. If you want to include parts of the matched pattern in the substitution pattern, use \0 - \9. These strings refer to parts of the matched pattern. \0 refers to the entire match -- so if you want to put parenthesis around some text, so this: s/fred/(\0)/g This will replace all instances of "fred" in the current line with "(fred") \1-\9 refer to the sub-parts of the matched pattern which are identified with \(...\) groupings. Nested groups are possible, and the number identifies the \( group in the pattern. For example: \(fred\(bill\)\(tom\)\) \0 matches "fredbilltom" in the line \1 matches "fredbilltom" as well \2 matches "bill" \3 matches "tom" Note that you can use \| to mean OR. As in: \(a\|b\) This would leave \0 containing either a or b when used in the substitutions. Note that regexes and substitutons are "variable expanded" before use. Variable expansion just replaces text of this form: "\{varname}" with the contents of that variable. Note, there is NO \ before the trailing }. Variables are assigned like this: |var|=; // store the current line in the variable |var|+; // append the current line to the variable with an intervening \n. |var|s/regex/substitution/options; // applies the substition to the // named variable instead of the // current line. Note that to replace an empty variable with some string using the regex substitution, you can't just say: |var|s//stuff/g; You have use this approach: |var|s/^.*$/stuff/g; this will let you replace any text in the variable with stuff -- even if there is none. Hopefully, you can use the "=" command instead, but it isn't always possible. Note: you can initialize variables in a prefix section which is executed before the beginning of actual processing of your input stream. You can print them after the last line of the input stream is processed using a suffix section. These are defined on the command line like this: lbsplit -px "{prefixSection with Actions;}" -sx '{suffix section ...}' My regular expression text has /'s in it, how do I make that work? - Two ways: * You can escape the slashes in the text, like this: \/ * Or, you can use different delimiters for the regular expressions: /, :, or % can be used as the delimiter. Whichever one you start with defines the delimiter for the entire regular expression or string. For example, you could write the above invocation like this: lbsplit -n file -S '{ %^{%,%^}% : | tr '|' '\n' Then you can extract a section, turn it into a single long line, then use sed to make substititions in that line, the use tr to split the lines back out. What kind of regular expression am I using? - sed style You can use \| to define regexes that match either one pattern or another. You use \( ... \) to encapsulate sub-expressions (rather than () like is done with perl). CAVEATS: & is not interpreted in regular expression substitutions on the right hand side. Use \0 instead of &. Note: Regular expressions can have options: /fred/i matches both "fred" and "Fred" and "FRED". How to I select specific lines for textual transformations? How I constrain substitutions to particular lines? How do I write conditional statements? How do invert the logic of conditional statements? - Any command in a section can be preceded with one of the following: , ,$ /regex/ ! !, !,$ !/regex/ For example: 2d 2,3s/fred/bill/1; :Tom:s/om/OM/g; !:frank:s/om/OM/g; The prefixes select only the specified line number, line range, or lines matching a specified regex. How do I delete, discard, or suppress a given line? - the 'd' command -- which can be used conditionally, see above. How can I suppress the printing of empty or blank lines? - blank lines do not match the regex, /./, and so you can use this to detect and skip them. Use this command action: !/./ d; How do I define repeated sections without cutting and pasting the text? - Sections defined with a trailing + are repeating sections. For example: {/begin/,/end/}+ Note that normally, repeating sections should be the LAST section. However, if you define a section that does not have an end regex and is NOT marked as a while section (using the /w option on the begin regex). Then, you can have other sections defined after it. Test30 in the source distribution does this. Here is its input data set: Intro 1 Intro 2 Intro 3 Page 1 p1a p1b Page 2 p2a p2b Page 3 p3a p3b Trailer t1 t2 Here are the sections as defined by the command invocation below: 1 Intro 1 1 Intro 2 1 Intro 3 2 Page 1 2 p1a 2 p1b 3 Page 2 3 p2a 3 p2b 4 Page 3 4 p3a 4 p3b 5 Trailer 5 t1 5 t2 The one of the command invocations for test 30 is: lbsplit -n tests/pageTest.txt -S \ '{/Page/w! N;}' \ '{/^Page/ N;}+' \ '{/Trailer/ N; }' How do I execute commands after the last line of the section? - The 'A' action expects another action as its argument and it inserts that action into the suffix list of the current section. Suffix actions are execute after the section is finished. How do I execute commands before the first line of the section? - The 'B' action expects another action as its argument and it inserts that action into the prefix list of the current section. prefix actions are execute before the first line of the section is processed. Note that you could implement this using the conditional action: 1{prefix action list} How do I suppress the printing of the entire rest of the section? - The 'q' command turns off the current line and all others in this instance of the current section. If the+ operator is used on the section definition, the q command has no effect on future instances of this section. Why doesn't regex /./ match empty lines? - It isn't supposed to. If you want to substitute a truly blank line into something else, you can use this: s/^$/OtherStuff/1; Alternatively, you could just use the l/OtherStuff/; command to force the line, whatever it contains, to be equal to OtherStuff. This is particularly helpful for variables whose values usually default to an empty string. If you want to perform a regex substitution on a any line, even if it is empty, do this: s/^.*$/desired/1; How can I use variables in my scripts? - The substitute command lets you use \{varname} syntax to specify a variable whose value will be inserted into either the regex or the substitution. Do not use a \ on the trailing }. This syntax can also be used in the regex (left hand side of the substitution). The varname refers either to a program variable or an environment variable if no program variable by that name exists. Program variables are initialized like this: |varname|s/^.*$/SOME DEFAULTVALUE/g; replace extant contents with new |var|l/SOME DEFAULT VALUE/; load var with text |var|=; load var with current line |var|+; append current line var with an intervening \n. S/\{bill}/stuff-\{george}/; store stuff-(the contents of george) into the variable whose name is found in bill Note that the entire syntax is required if you are defining a previously undefined variable. This syntax can be used in B actions so that you don't have to waste compute cycles on every line of the input file. Alternatively, you can detect text in the body of a section that you want to store ni the variable, varname, and/or update it as the scripts run. Any action that modifies a the current line can be applied to a variable and some special actions can only be used with a variable. The command 'm' exists to let you expand variable names into their values without having to go through the substitution process. Essentially, the 'm' command maps the current line to the value of a variable whose name is specified on that line -- or an environment variable if none is found. This command is not overly useful, but you might need it. Note: you can initialize variables in a prefix section which is executed before the beginning of actual processing of your input stream. You can print them after the last line of the input stream is processed using a suffix section. These are defined on the command line like this: lbsplit -px "{prefixSection with Actions;}" -sx '{suffix section ...}' How do I write "and" and "or" clauses using regular expressions? - Boolean logic equates the expression A && B with !( !A || !B) It also equates A || B With !( !A && !B ) When defining sub-ranges within a section over which to apply commands, you can use the following inside a section definition: !,{ commands } This says to execute the commands if you are not in the range defined by to . "AND" clause, you can do the following: /regex1/ /regex2/ commands which means to execute commands only if both regexes are true for the current line. "OR" clauses are implemented like this: /regex1\|regex2/ { commands } Given these linguistic features, and the bool logic underlying your needs, it may or may not be possible to accomplish the filtering you wish to do. Note that the above apply to all regular expressions, but the following does not: When defining sections, you supply either 1 or two regular expressions which define bounds of the section. To implement a selection between 1 section or another, use the '?' operator: ?{ { /s1/ ... } { /s2/ ... } .... } That is, you can specify more than one section in a group. lbsplit will choose and activate whichever section it encounters first. Note that this is strictly an "or" situation. You can't generally use nested sub-sections in lbsplit. Why is my section only 1 line long? - The section selection regular expressions control this. If you use the same regular expression for the beginning and ending regular expressions, then you will get a 1 line section: { /fred/,/fred/ ... } { /./,/./ ... } - Note that you can avoid this nuisance by adding the '>' option to the end option for your section. That option requires that sections end on a different line than on which they begin (when a two regex section is defined). { /fred/,/fred/> ... } Why does my range condition action command apply to only 1 line? - For the same reason as the above -- by default, both the begin and the end conditions are applied to the current line. If you want to guarantee that your condition range spans more than one line, use the '>' character at the end of the second range: { /fred/,/fred/> # the section is more than 1 line long /tom/,/tom/> action; # the range of tom actions is more than # one line long } Why does my range condition include too many lines? - A range condition in a section selects a subset of the lines in a section for special processing. Consider: { /./,/end of file/ /beginline/,/endline/ { cmds; } } Here, the entire file is defined as a single big section. Each subset of the lines in the file beginning with /beginline/ and ending with /endline/ will have the cmds applied to them. Note that this is repeated infinitely. Since your range conditional regexes can contain \{var} references, you could have the cmds in the range change the var to be an un-matchable string and thus you would only truly match on the first instance. For example: { /./,// B{ |var|l/beginline/; } /\{var}/,/endline/ { cmds ; |var|l/invalidstuff/; } } Here's how this section definition works: * The section is defind by any non-blank line and ends when when a line containing is matched. Presumably no such line exists, so section runs from the beginning of the file to the end thereof. * Before the section is executed, the variable, var, is initialized to "beginline" * As the section is processed, each line (of the section and thus the whole file) is compared against the range: /\{var}/ # will contain beginline the first time through /endline/ * The first time that a line containing "beginline" is found, the sub-section defined by the range condition will become active and the range commands will be applied. In this case, the variable, var, will be changed from "beginline" to "invalidstuff". This will effectively eliminate the possibility that any other lines in the file will be effected by the range, and when this range ends (when endline is found), that will be the end of the range's use. How can I compute a range condition's regex? - As just shown, a range condition within a section can be defined as a regular expression: { /./,// B{ |v| l/P/; } << variable v gets a capital P; /\{v}/,/p/ { P; } << only print lines in the range d; << delete all the rest } How can I quit processing the entire section? - The q; command terminates the entire section -- immediately -- it does not wait until the proper end of the section is found. How can I filter out a section? How can I selectively filter out a section? - The easiest way is to use lbsplit without the -n option, the define the section you want to filter out and have it no print anything. For example: lbsplit somefile.txt -S '{ /begin/,/end/ d; }' This prints the entire file, somefile.txt, to stdout, except for the text between "begin" and "end", because all the lines in that section are deleted. Note that you can't use 'q' here, because only the 'begin' line would be deleted. The q command would terminate the section as soon as it was executed. - If you want to examine the section before deciding to filter it out entirely, you can use this trick: a. don't just delete the lines in the section, as a above, but also append them to a variable. b. if you decide to keep the section, print the variable. For example: lbsplit somefile.txt -S \ '{ /begin/,/end/ B{ |lines|l//; |keeplines|l//; } A{ |keeplines|/./ |lines|P; } |lines|+; /fred/|keeplines|l/keepit/; d; }' Here's why this works: 1. Since the -n option is not used, all lines which are not part of a section, are printed automatically. 2. The section of interest is defined by a begin/end pair of lines. 3. When the section is first entered, before any lines are processed, two variables are initialized to empty: "lines" and "printit". The "lines" variable will hold all the lines in the section. the "keeplines" variable will serve as a boolean flag meaning that we have decided to keep the section in the output. The "B" command defines the list of commands to be executed before the first line is processed. 4. Each line is ultimately deleted from the output by the "d;" command that appears at the end of the section. Such commands must go at the end if you wish to do any other processing in the section. 5. Each line of input text is appended to the "lines" variable with a leading newline. 6. As each line of the section is processed, it is compared against the regular expression, /fred/. If such is found, the the variable keeplines is modified to contain the text constant "keepit". 7. The "A" command defines the behavior that occurs after the last line of the section is processed. Here, the "keeplines" variable is compared against the regular expression, /./, which just checks to see if the variable is empty or not. If it is not empty, then the statement prints the contents of the lines variable. How do I execute commands before the first section and after the last? - prefix and suffix sections can be defined on the command line like this: -px '{ prefix section }' -sx '{ suffix section }' You must can use '/./' as the section defining regexes. This is only really useful for printing things and for initializing variables. How do I deal with columnar data? - The c command (cut), lets you select columns out of the current line and replace the line with the selection. For example, suppose the current line were this: 0123456 And a cut command action like this were used: c 1-3,7 The line would be replaced with 0126 Note that you probably want to use the 't' command to make sure tabs get expanded before using cut. How do I expand tabs in my input data? - The t; command expands tabs into space. T; compresses with tabs; How do read a file and print it? - The r; command lets you read a file and print it instead of the current line. The r command has two forms; r; uses the current line as the file name r file; uses 'file' as the filename -- the text is expanded before use. How do I parse a line's contents into variables? - The g/regex/varlist; command lets you parse the current line or variable into pieces using a regular expression and list of variables into which to place the regular expression match information. The variable list is a list of variables separated by '|'s. Here is an example invocation: g/.*/match; In this case, the variable match will populated with the entire contents of the variable or line that is the context of the command. Here's another: g/[a-zA-Z_]\+/firstWord; In this case, the variable firstWord will be populated with the first word on the line. Here's another: g/Section: *\([^ ,]\+),/wholeMatch|sectionName; In this case, variable wholeMatch will contain something like this: Section: 100, And variable sectionName will be populated with "100". Here's a more complex example: g/Section: *\([0-9]\+\)\.\([0-9]\+\/whole/firstDigit|secondDigit; In this case, if the input data contains; Section: 1.2 the variable, whole, will be populated with "Section: 1.2", and the variable, firstDigit, will be populated with "1", and the variable, secondDigit, will be populated with "2". How do I write while-loop statements? - While loops are limited to repeated processing of the same line. You can't use a while loop to process multiple lines. Here's the syntax for processing a line multiple times: w/regex/action Or w!/regex/action This command action is only useful if hte action modifies the current line so that the while loop eventually terminates! Otherwise, the script will hang.