lbsplit v1.1.2                                                          lbsplit


NAME
    lbsplit - Lowell Boggs' file splitter

ABSTRACT

    Split files into sections and perform translations on each based on user
    defined scripts which are very similar to sed scripts -- except that
    they support named variables, text justification, etc.

    See FAQ.txt and manual.html for more examples, longer discussions, etc.

SYNOPSYS

    lbsplit [processingOptions] filename ... [sectionOptions]

    "processingOptions" modify the behavior of the program with respect
    to all sections.

	-n                            suppresses the printing of text
				      which is not part of a section

	-f prefix                     specifies the prefix part of the
				      filenames that MIGHT be used when
				      processing sections.

	-N digits                     specifies the number of digits to
				      generate when numbering output files.

	-d                            turn on debug outputs -- bulky ugly
				      stuff, useful for diagnosing your
				      section definitions.

	-D                            print extra debugging information, like
				      variable contents at the end of execution.
				      Changes some output formats.

	-v                            print the lbsplit version info.

	-px prefixSection             define a prefix section's actions (mainly
				      useful for initializing variables and 
				      printing things)  For example:  

					-px '{/./  |var|l/stuff/;}'

				      This section loads variable, var, with 
				      "stuff" before any sections are processed.  
				      Variables default to empty strings, so 
				      only use this for non-null 
                                      initializations.

	-sx suffixSection             define a suffix section's actions (mainly useful
				      for final printouts).  Note that to check for 
				      a variable being populated, use this:

					-sx '{/./   |var|/./ { P; } }'

				      this example prints the variable only if 
				      it is not empty at then end of a run.

    "sectionOptions" end the list of file names and start the list
    of section descriptors.

	-S [sectionDescription ...]   indicates the descriptions are
				      specified on the command line

	-F sectionFileName            indicates the descriptions are
				      found in specified file

	-FH [0-9]                     specifies that the section descriptions
				      should be read from a file handle which
				      is initialized by the shell that starts
				      lbsplit.  Here's a simple script example:

					 exec 3<<EOS
					 {
					    ...
					 }
					 EOS

					 lbsplit -n - -FH 3

				       Note that this particularly helpful if
				       the program needs to read stdin to get
				       the data as well as having the script
				       itself passed in without disk files
				       being involved.


DESCRIPTION

    lbsplit is similar to csplit.  The general idea is that lbsplit will
    be used to parse lines of text out of a set of files and then group
    them into "sections" upon which commands, specific to that section,
    can be performed.

    Each section is defined by regular expressions that define
    the bounds of the section as well as an algorithm for processing the
    section.

    There is a default algorithm for processing text which does not comprise
    any of the sections:  print it to stdout.  Thus, if lbsplit is used without
    any sectionDescriptors, then it behaves like the cat command.  For
    example, these commands just print the files named on the command line
    to stdout:

       lbsplit file1.txt -S
       lbsplit file1.txt
       lbsplit file1.txt file2.txt file3.txt ...

    If no file names are specified, then no processing occurs.  If a file named
    "-" is specified, then it is assumed to refer to stdin.  But only 1
    file named "-" can occur on the command line without causing an error.

    Note that the -n option eliminates the default action -- rather converts
    it into "ignore all lines".  So that:

       lbsplit -n file

    produces no output.

    More than one section descriptor can be specified, and each has its
    own definition of what the section means and its own algorithm for 
    processing the section:

	lbsplit file -S sd1 sd2 sd3 ...

    Where sd1 - sd3 are sesson definitions.

    When multiple section descriptors are specified, they are processed in
    sequence.  Text that occurs between the sections will be processed using
    the default algorithm.  Again, the totality of the input file text is 
    processed, as if it were a single giant file, using the session descriptors
    in sequence.


    DEFINING SECTIONS

    Sections are defined as parameters on the command line or in a file.  When 
    specified on the command line, each parameter beginning with '{' is 
    assumed to be a session description.  When sections are specified in a 
    file, use -F to name the file.

    It is possible to define unary sections and selector sections.  An unary 
    section is just a single section definition and looks basically like this:

      { /regexes/ cmds }

    A selector section is defined like this:

      ? {  {s1} {s2} ... }

    Here, lbsplit will dynamically select which section, s1 .. sN, based on the
    the data in the input stream.  Use selector sections to deal with data 
    whose exact format varies between some small number of forms.
    
    Each normal section (unary section) definition has five parts:

      1   the leading {
      2.  the regular expression that demarks the beginning of the section
      3.  optionally, the regular expression that demarks the end.
      4.  the actions to perform on that section.
      5.  the trailing }

    Zero or more actions can be specified for a section.  If no actions are
    specified, the section printed to stdout using the default end of line
    sequence.  Actions are a sequence of commands
    that will be applied to each line in the section.

    Here is an example section description:

	lbsplit someFile -S '{/^BEGIN/,/^END/ s/fred/bill/g; }'

    In this invocation, lbsplit will do the following:

      1.  any text which is not within the _first_ BEGIN...END block will be 
	  processed according to the default action (here, it will be printed).

      2.  all text within the first BEGIN...END block, including the lines 
	  containing BEGIN and END, will be processed like this:

	    first, substitute "bill" for all instances of fred

	    second, print the line (since we are not told not too).

    Note that only the first BEGIN...END block in the file will be affected
    by this invocation -- if you want to affect all BEGIN..END blocks
    in the file, terminate the section description like this:

      { ... }+

    or like this:

      { .... } 100   

    In this second form of repetive section definition, at most 100 BEGIN/END
    blocks will be processed.

    The trailing plus means "one or more matches of this section.  Again,
    even in this case, text between the sections is processed using the
    default algorithm.  Note that repeated sections should normally be used
    only at the end of the list of sections, since they normally consume all
    remaining text in the input stream.  However sections which have only a
    single beginning regex defined, and which are not also marked as 'w'
    sections, can be marked as repeating and still have additional sections
    defined.  See Test30 in the Makefile for an example.

    An action is a sequence of characters ending in a semicolon.  Actions 
    cannot contain a semicolon unless it is escaped using a leading backslash.
    Curly braces do not need to be escaped.

    Actions define the command syntax to be applied to each line in section.
    At least one action must be supplied or the section will be ignored.  If
    actions are specified, they are applied in sequence to each line in the 
    section --  although some actions affect the section as a whole, not 
    specific lines.  Actions are as follows:

      action;              -- perform some action on the current line of 
			      input text.

      |var|action;         -- perform an action but act on a variable's
			      contents instead of the current line.  Note that
			      some commands affect the section processing
			      as a whole, not just the current line or 
                              variable.

      F;                   -- output this section to a file whose name is
			      dependent on the current section number rather
			      than to stdout.  Only one F statement per section
			      definition is allowed.
			   
      s/this/that/[i1g];   -- regular expression substitution of "that" for
			      "this", either exactly 1 time on the line or
			      globally across the line (based on the last
			      character (either '1' or 'g')
			   
			      "that" can contain references to the matched
			      text and matched sub-fields thereof.  See
			      the man page for sed to understand the syntax.

			      Note that instead of using / as the separator,
			      you can also use:  %, or :.

			      Note also that if a variable context is specified
			      using |var| as a prefix before the 's', then that
			      variable is modified, not the current line.

			      The option, "i", means to perform a case
			      insensitive comparison.
			   
      y/[a-z]/[A-Z]/;      -- In the current line or variable, translate the 
			      characters in the input set to the corresponding 
			      output set.  Note that this does not apply to the 
			      end of line character. See the E command, below.
			   
                              Note that instead of using / as the separator,
                              you can also use:  %, or :.

      E/eolText/;          -- instead of printing \n at the end of each line
			      in the section, print the specified string --
			      which must include \n if you want the lines
			      to remain separated.
			   
                              Note that instead of using / as the separator,
                              you can also use:  %, or :.
			   
      d;                   -- suppress the printing of this line. If this 
			      command is used unconditionally, it will 
			      suppress the printing of the entire section.
			      You probably want to combine it with a 
			      conditional action of some form:

				 /regex/d;
				 2,9d;

			      Note that once used, the line processing stops,
			      so the d action, if used, should occur as near 
			      the end of the line as possible to avoid GREAT
			      difficulty in figuring out why your script 
			      doesn't work! 

			      See also, the 'q' action, below.
			   
      P;                   -- print the line as it stands now.  This allows
			      you to print the line multiple times.  The
			      "P" command prints the line as it currently is,
			      before any subsequent processing.  

			      This command is often combined with variables
			      or with the d or q commands.

      /regex/[o] action;   -- execute action only if the current line matches
      /rg1/,/rg2/[o] action;  the specified regex.
			   
                              Note that instead of using / as the separator,
                              you can also use:  %, or :.

			      Also note that you can use the \{var} references
			      in the regex for a range condition.

			      rg1 and rg2 define a range of range of lines
			      that the action will be performed on.

			      Note that this range of lines is in no way
			      exclusive to other actions for lines in the 
			      section. That means that you can have multiple 
			      ranges simultaneously active and also have 
			      non-ranged actions performed on the line.  This 
			      kind of action does NOT create a sub-section!

			      The regexes, regex, rg1, and rg2, are optionally
			      followed by option characters:

				i means the regex is case insensitive
				> means the range spans at least 2 lines.

      !/regex/action;      -- execute action only if the current line does NOT
      !/reg1/,/reg2/action;   match the specified regex.

			   
                              Note that instead of using / as the separator,
                              you can also use:  %, or :.

			      See previos section for notes on options to 
                              regexes.

      2,4action;           -- execute the specified action only if the current
			      line number with in the current section is in
			      the range 2-4.  Note that any pair of numbers
			      can be used but the second number must be >=
			      the first.  You can also use $ for the second
			      number and means "end of the section".

      3action;             -- execute this action only if the current line
			      number within the current section is 3.  Any
			      number may be used (but not $)

      !<line>action;       -- execute action on any but the specified line.

      !<line1>,<line2>a;   -- execute a on any line NOT in the range of 
			      line1 to line2.
			      
			   
      {action; ... }       -- define an action list -- only useful when you are
			      using a conditional action of some kind. 
			   
     p/text/;              -- Print the text using the current end of line 
			      string.  You can embedded control characters in 
			      it using \r,\n, etc.
			   
                              Note that instead of using / as the separator,
                              you can also use:  %, or :.

			      Note also that you can include variable 
			      references in the printed text:   \{varname} is 
			      replaced with variable varname's contents.

     t;                    -- Replace the tabs in the current line with 
			      enough spaces to align to the proper tab
			      position (8 chars per tab per unix std)

     T;                    -- Replace leading space in the line with tabs
			      in 8 char chunks.

     n;                    -- prepend $lineNum\t to the begining of the line or 
			      variable.

     N;                    -- prepend the current section number and \t to the 
			      begining of the line or variable.

     I;                    -- prepend the current line number and \t to the 
			      begining of the line or variable.

     f;                    -- prepend the current (1) input file name, (2) a 
			      tab, (3) the line number, relative to that file 
			      (one based), and a second (4) a tab to the 
                              beginning of the current line or variable.

     A action;             -- execute the specified action after the last line
			      of the section. Variables modified by the section 
			      are still available.

     B action;             -- execute the specified action before the first line
			      of the section.  Similar to the "1 action;" 
			      definition, see above.

     l/text/;              -- replace the current line or variable with text -- can 
			      include escape sequences etc.  In addition to /, 
			      the % and colon characters can be used as string 
                              delimiters.

			      You can use \{var} to include the contents of 
			      variables in the text.  You can use l//;  to 
			      initialize the line or variable to an empty state.

			      Here's how you initialize a variable to an empty 
                              string:

				|varname|l//;  

			      Ugly but works.

     m;                    -- treat the current line as a variable name,
			      find that variable's value, and replace the
			      current line with that.

     |var|x;               -- replace the variable with current line.

     |var|=                -- replace the variable with current line.  Not 
			      usable without a variable reference.

     |var|+                -- append the current line or variable to the to the variable 
			      named in the variable context -- with an 
			      intervening \n.  A substitute command to get rid 
			      of it can be written like this:

				 |var|s/\n//g;

			      Here:

				 |var1||var2|+;

			      Will append \n followed by var2 to var1.

     S/varExp/valueExp/;   -- Compute a variable's name from varExp, and store 
			      the value defined by the valueExp in that variable.  
			      The computations involve expanding any \{var} 
			      references in the text (and escape character 
			      interpretations).

     j cnt;                -- left justify the current line or variable in a
			      field of spaces "cnt" wide.

     J cnt;                -- right justify the current line or variable in a
                              field of spaces "cnt" wide.

     q;                    -- stops processing of this section immediately.

     c cutset;             -- selects ranges of columns, like the cut -c 
			      unit command.  For example:

				c  1-10,40-99

			      will select the first 10 and 40'th through
			      the 99'th character from the string -- it will
			      concatenate them into a single string and
			      replace the current line with that string.

    r [file];              -- print a file instead of the current line.
			      If the file's name is specified in the command,
			      variable expand the name then print it.
			      If the file's name is not specified (r;), then
			      use the current line as the file name and
			      print that.

    g/regex/varlist;       -- get variables from the current line using a
			      regular expression to parse them out.  For
			      example;

				g/stuff\(x1\)crap\(x2)/match|p1|p2;

			      Here, if the line matches the pattern, variable
			      match will contain a non-empty value and,
			      and so will p1 and p2.  Variable p1 will contain
			      "x1", and variable p2 will contains "x2";

			      If the current line does not match the pattern,
			      match, p1, and p2, will be emptied.

    w/regex/action            While the regex is true of the current line,
			      execute the action.

    w!/regex/action           While the regex is NOT tru of the current line,
			      execute the action.


    DEFINING SECTION BOUNDARIES

    Each of the regular expressions that define the start and end of the 
    section are specified like this:

      /pattern/[options]

    The options indicate how to deal with the line containing pattern.
    The trailing options are as follows:

      no options -- means take the default options for this line and 
		    pattern with respect to the match.  Basically this
		    means that if the line contains the pattern then
		    it matches the begin or end of the section.

      !          -- invert the pattern.  If the pattern does not match,
		    then it defines the begin or end of the block, rather
		    the reverse.

      w          -- used only on the begin section regex, it means that the
		    section is defined by all lines that match the single
		    regex, and does not require or allow an end regex.

      i          -- make the comparison case insensitive.

      >          -- ensure that the section ends on a different line than
		    the one on which it starts (use > only in the end
		    regex)


    These options are necessary to deal with different kinds of text blocks.
    For example, consider this example input file:

      blah blah blah
      BEGIN something or other
	stuff in the section
      END
      blah blah blah

    This section could be matches like this:
    
      { /^BEGIN/,/^END/ ...

    In this case, all lines begining with the first BEGIN and ending with the 
    first END will be processed.  If you don't want to include the begin and 
    end lines in the output, say for example, you only want "stuff in the 
    section" to appear in the output, do this:

       { /^BEGIN/,/^END/  /^BEGIN/d; ^/END/d; }

    This will select all lines from BEGIN to END, but will suppress
    the printing of the BEGIN and END LINE.

    Alternatively, suppose your input text looks like this:

      a1
      a2
      b1
      b2
      c1
      c2
      c3

    In this case, sections can be identified by the first character in
    the line, but there is no clear end of section line to match on.  To
    match all the lines beginning with a, but not include the first line
    containing the b as part of the first section, you use the 'w' option
    to the begin regex in the section, and do not supply an end regex:

      { /^a/w ...}

    Here, the w option means:  process lines matching /^a/ as part of the
    block, until it stops.  When the line does not match, that is not part
    of the section.

    And suppose your input file looks like this:

       <H1>Section1</H1>
	 paragraph
       <H1>Section2</H1>
	 par2

    In this case, <H1> and paragraph should go together.  To accomplish this,
    use the following section definition:

      { /<H1>/ action1; ... }

    Here, the lack of the ending regex means to process <H1> line and all 
    lines until, but not including, the next <H1> as part of the section.

    Presumably, EVERTHING after the final match of the regex is meant to go 
    into the final section.


CONDITIONAL EXECUTION OF ACTIONS

    To reiterate from the command actions described earlier, there are 2 forms 
    of conditional execution:

      1.  you can restrict the execution of actions only to lines which 
	  contain a regular expression:

	    /regex/action;    note that the action can be a block:

	    /regex/{action1;a2;a3}

      2.  You can restrict execution to only those lines with a particular
	  line number within a block.

	    2,8d;

	    2,8/stuff/action;  combines 1 and 2

      3.  You can invert regex line selections, and execute the action only if 
	  the line does not contain a specified regular expression:

	    !/regex/action;

      4.  You can create "and" clauses like this:

	     /regex1/  /regex2/ actions;

      5.  You can create "or" clauses like this:

	     /regex1\|regex2/

	     !/r1\|/2/


AUTOMATIC NUMBERING

    To reiterate information from above:

	1.  sections have numbers

	2.  lines within sections have numbers

	3.  lines within the input stream as a whole have numbers.

	4.  lines within input files have numbers.

    You can prepend the line number plus a tab to the beginning of any line
    using one of the following commands:  n, N, f, or I.

    To move the the number somewhere else in the line, you can use the
    substitute command:

	N;  s/\(^[^\t]\)\t\(.*\)/\2:\1/1;

    This command prefixes the current line with the section number and a tab.
    Then, the substitute command discards the tab and puts the section
    number at the end of the line -- preceeded by a colon.


EXAMPLES

    1:  suppose you want extract only the generated suppressions from a 
	valgrind output, which looks like this:

	  == 459 == Valgrind ...
	  == 459 == Some Error
	  ...
	  {
	    generated suppression text
	  }
	  == 459 == Other Error
	  {
	    generated suppression text
	  }
	  ...

	That is, there is a lots of text in which a small number of {}
	blocks are interspersed.  These blocks begin and end in column 1 of
	the line on which they occur.

	  lbsplit -n valgrind.txt -S '{/^{/,/^}/.;}+'
          
	This command discards all text which is not part of a {} block (-n).
	It simply prints the block, including the {}'s.  The trailing + means
	to repeat the section ad-infinitum

    2:  Suppose you want to filter duplicate blocks of text which occur in a
	file formatted like this:

	blockstart
	  middle
	blockend
	...

	The following command can be used:

	 lbsplit -n file \
	       -S '{/^blockstart/,/^blockend/ $/|/; A{$//; p/\n/;};}+' |
	    sort -u | tr '|' '\n'

	This command does the following:

	  1.  it suppress all text that is not part of a blockstart/blockend
	      pair.

	  2.  it converts the end of line string from \n to |.

	  3.  After the block is finished, it converts the end of line string
	      to nothing, then prints \n.  This leaves the entire text of a 
	      block as a single giant line of text.

	  4.  piping this output to the sort command, and using the -u option
	      eliminates duplicate blocks (which are not just single lines
	      fed to the sort command).

	  5.  the final 'tr' command converts '|' back to newline to that the
	      blocks have their proper line splits.

SEE FAQ.txt in the source distribution for more examples and ideas.


SEE ALSO

    csplit, split, sed, grep, perl, cut