9  Text Processing

In this chapter we will tackle simple text processing.

9.1 Learning Objectives

  • Command Line processing

  • Regular Expressions and Pattern Matching

  • Exceptions and error handling

  • Pointers

  • File system

9.2 Projectlet - vrex

There are quite a few resources e.g. https://regex101.com that help understand regular expressions better. For anyone tasked with validating say a phone number or postal code with international applications, such tools help experiment. This project builds such a tool but without any of the scaffolding like the web etc. A command line utility that can validate a string against a regular expression. As an extension, validate all the lines in a file against this pattern.

~/bin/vrex -h
vrex 0.1 Nov 25 2024 05:48:52
Usage: vrex regex [<string>] or -f  file1 file2 ,,.

 -v, --verbosity ARG    Verbosity Level
 -f, --files            Validate lines in the input file(s)
 -g, --glob             Glob Option
 -c, --case-insensitive Case insensitive. (Default sensitive)

Simple usage examples:

~/bin/vrex procedure Procedure
Does not match

The above does not match since case sensitivity is the default option. When we ignore the case, the match is successful as shown:

~/bin/vrex -c procedure Procedure
Matches

On the other hand we could accept either case for just the first character:

~/bin/vrex "[pP]rocedure" Procedure
Matches

And match each line of a file:

~/bin/vrex "[pP]rocedure" -f ../../toolkit/examples/vrex/src/*.ad*
Pattern [pP]rocedure ../../toolkit/examples/vrex/src/cli.adb
 0 lines matched
Pattern [pP]rocedure ../../toolkit/examples/vrex/src/cli.ads
 0 lines matched
Pattern [pP]rocedure ../../toolkit/examples/vrex/src/impl.adb
 0 lines matched
Pattern [pP]rocedure ../../toolkit/examples/vrex/src/impl.ads
 0 lines matched
Pattern [pP]rocedure ../../toolkit/examples/vrex/src/vrex.adb
 0 lines matched

In comparison, since vrex matches entire lines:

~/bin/vrex ".*[pP]rocedure.*" -f ../../toolkit/examples/vrex/src/*.ad*
Pattern .*[pP]rocedure.* ../../toolkit/examples/vrex/src/cli.adb
 16    :    procedure StringArg (name : String; ptr : GNAT.Strings.String_Access) is
 27    :    procedure Show_Arguments is
 33    :    procedure SwitchHandler
 47    :    procedure ProcessCommandLine is
 4 lines matched
Pattern .*[pP]rocedure.* ../../toolkit/examples/vrex/src/cli.ads
 15    :    procedure ProcessCommandLine;
 1 lines matched
Pattern .*[pP]rocedure.* ../../toolkit/examples/vrex/src/impl.adb
 20    :     procedure Matches( pattern : String ; filename : String ;
 1 lines matched
Pattern .*[pP]rocedure.* ../../toolkit/examples/vrex/src/impl.ads
 5     :     procedure Matches( pattern : String ; filename : String ;
 1 lines matched
Pattern .*[pP]rocedure.* ../../toolkit/examples/vrex/src/vrex.adb
 4     : procedure Vrex is
 1 lines matched

9.2.1 Implementation

The requirements are simple enough and the predefined library supports the package GNAT.RegExp which has the necessary support. So the environment is rather sparse:

~/bin/codemd ../../toolkit/examples/vrex/src/impl.adb -x Environment -l
0002 | with Ada.Text_Io; use Ada.Text_Io ;
0003 | with GNAT.RegExp ;

To match a string argument:

~/bin/codemd ../../toolkit/examples/vrex/src/impl.adb -x RegExp -l
0010 |     function Matches( pattern : String ; line : String ;
0011 |                       glob : boolean := false ; caseinsensitive : boolean := false ) return boolean is
0012 |         exp : GNAT.RegExp.RegExp := GNAT.RegExp.Compile( pattern , glob , not caseinsensitive );
0013 |     begin
0014 |         return GNAT.RegExp.Match( line , exp  );
0015 |     end Matches ;

And match each line of a file:

~/bin/codemd ../../toolkit/examples/vrex/src/impl.adb -x ExpCompile -l
0019 |     pcompiled : access GNAT.RegExp.RegExp ;
0020 |     procedure Matches( pattern : String ; filename : String ;
0021 |                        glob : boolean := false ; 
0022 |                        caseinsensitive : boolean := false) is
0023 |         file : File_Type ;
0024 |         line : String(1..MAX_LENGTH);
0025 |         linelength : Natural ;
0026 |         lineno : Natural := 0 ;
0027 |         count : Natural := 0 ;
0028 |     begin
0029 |         if pcompiled = null
0030 |         then
0031 |             pcompiled := new Gnat.RegExp.RegExp ;
0032 |             pcompiled.all := Gnat.RegExp.Compile( pattern , glob , not caseinsensitive );
0033 |         end if ;

...

0041 |             Get_Line(file,line,linelength);
0042 |             lineno := lineno + 1 ;
0043 |             if GNAT.RegExp.Match( line(1..linelength) , pcompiled.all  )
0044 |             then
0045 |                 count := count + 1 ;
0046 |                 Put(lineno'Image);
0047 |                 Set_Col(8);
0048 |                 Put(": ");
0049 |                 Put_Line(line(1..linelength));
0050 |             end if ;

...

0055 |         Put(count'Image);
0056 |         Put_Line(" lines matched");
0057 |     end Matches ;

Note the compilation of the RegExp just once - the first time. The variable pcompiled is an access type variable - which is a pointer but to an object of a specific data type GNAT.RegExp in this case. Assignments to it can only be pointers to other GNAT.RegExp objects. The default value of this and all access variables is the special value NULL. Internally NULL may be any value but does not point to any object.

An assumption is made in this projectlet that the search is for the same expression in a series of files.

9.4 Stretch

  • Instead of introducing special markers in the output when candidates are found in the input, highlight the text in the output for the strings found. Optionally find all instances in each line instead of stopping at the first successful find. Optionally handle binary files instead of just text files.

  • Code examples in the book have all been produced by the tool codemd which processes text files, looking for markup commands. Markup commands are specified and compiled as regular expressions somewhat patterned as above. The language agnostic implementation at https://gitlab.com/ada23/codemd.git can be improved e.g. to provide a caption and numbering for each segment of code.

  • In the GNAT programming system, there is a third tool - a family of packages GNAT.Spitbol which provide arguably the most powerful pattern matching support. Based on the decades old SNOBOL pattern matching system, this set of packages bring the same power with an API. Implementing a preprocessor for C to convert the enum definitions to be more like Ada e.g. by adding support for attributes like Image, Succ, Pred is a worthwhile exercise.

9.5 Sample Implementation

Repository: toolkit
Direcotry: examples/vrex 
Directory: examples/frep