In this chapter we will tackle simple text processing.
9.1 Learning Objectives
Command Line processing
Regular Expressions and Pattern Matching
Exceptions and error handling
Pointers
File system
9.2 Projectlet - vrex
There are quite a few resources e.g. https://regex101.com that help understand regular expressions better. For anyone tasked with validating say a phone number or postal code with international applications, such tools help experiment. This project builds such a tool but without any of the scaffolding like the web etc. A command line utility that can validate a string against a regular expression. As an extension, validate all the lines in a file against this pattern.
~/bin/vrex-h
vrex 0.1 Nov 25 2024 05:48:52
Usage: vrex regex [<string>] or -f file1 file2 ,,.
-v, --verbosity ARG Verbosity Level
-f, --files Validate lines in the input file(s)
-g, --glob Glob Option
-c, --case-insensitive Case insensitive. (Default sensitive)
Simple usage examples:
~/bin/vrex procedure Procedure
Does not match
The above does not match since case sensitivity is the default option. When we ignore the case, the match is successful as shown:
~/bin/vrex-c procedure Procedure
Matches
On the other hand we could accept either case for just the first character:
The requirements are simple enough and the predefined library supports the package GNAT.RegExp which has the necessary support. So the environment is rather sparse:
Note the compilation of the RegExp just once - the first time. The variable pcompiled is an access type variable - which is a pointer but to an object of a specific data type GNAT.RegExp in this case. Assignments to it can only be pointers to other GNAT.RegExp objects. The default value of this and all access variables is the special value NULL. Internally NULL may be any value but does not point to any object.
An assumption is made in this projectlet that the search is for the same expression in a series of files.
9.3 Projectlet - search
The goal of this projectlet is a utility to search text files for either a candidate string or a Regular Expression; optionally the candidate can be replaced by a replacement string. This can for example be used to convert all occurrences of e.g. PROcedure to procedure. The output of such a replacement is created as a new file (not overriding the input) in a specified directory. The entire operation is driven by the command line:
Processing the command line would involve handling a mutually exclusive set of switches -s and -r. In addition the switch -R is optional but should always be accompanied by -o. Fairly simple but typical of most command language driven utilities e.g. the C compiler.
Searching for a candidate string in a text file may be considered simple. Searching for a regular expression will need a library.
The input file is not overwritten and the substituted output shall be written to an output directory with the same name as the input file. File systems are the domain that this exposes us to.
In this projectlet exceptions are used to announce unexpected situations - particularly deep in the implementation.
The implementation then depends on the predefined language environment:
0002 | with Ada.Text_IO; use Ada.Text_IO;
0003 | with Ada.Strings.Fixed; use Ada.Strings.Fixed;
0004 | with Ada.Directories;
0005 | with GNAT.Regpat;
0006 | with GNAT.Strings;
0002 | with Ada.Text_IO; use Ada.Text_IO;
0003 |
0004 | with Ada.Directories;
0005 | with Ada.Exceptions;
0006 |
0007 | with GNAT.Command_Line;
0008 | with GNAT.Source_Info;
Packages referenced in the above, particularly Ada.Exceptions, GNAT.Source_Info, GNAT.Regpat, GNAT.Strings will be encountered repeatedly.
9.3.2 Command Line Processing
The workhorse for command line processing is the GNAT.Command_Line package. The switches that need to be handled are declared as:
0088 | if replacement.all'Length > 0 then
0089 | Put ("Output Dir ");
0090 | Put (outputdir.all);
0091 | New_Line;
0092 | if outputdir.all'Length < 1 then
0093 | raise CLI_ERROR
0094 | with "Please provide an output dir for edited files";
0095 | end if;
0096 | if not Ada.Directories.Exists (outputdir.all) then
0097 | raise CLI_ERROR with "Non existent output dir";
0098 | end if;
0099 | if Ada.Directories.Kind (outputdir.all) /= Ada.Directories.Directory
0100 | then
0101 | raise CLI_ERROR with "Provide a directory for output";
0102 | end if;
0103 | end if;
Any inconsistent inputs are handled by utilizing the exception mechanism which immediately returns control to an exception handler or the enclosing environment if none provided.
9.3.3 Regular Expression handling
If the command line specified a regular expression either as a search candidate or as a replacement candidate, the same regular expression is applied to each file in the command line.
Optimal use of regular expressions involves compiling the expression once and use the compiled output repeatedly e.g. for each line in the input as illustrated:
0122 | Open (file, In_FIle, filename);
0123 | while not End_Of_File (file) loop
0124 | Get_Line (file, line, linelength);
0125 | linenumber := linenumber + 1;
0126 | GNAT.RegPat.Match (pcompiled.all, line (1 .. linelength), matched);
0127 | if matched (0) /= GNAT.RegPat.No_Match then
0128 | Put (linenumber'Image);
0129 | Set_Col (6);
0130 | Put (" : ");
0131 | Put (line (1 .. matched (0).First - 1));
0132 | Put ("[");
0133 | Put (line (matched (0).First .. matched (0).Last));
0134 | Put ("]");
0135 | Put (line (matched (0).Last + 1 .. linelength));
0136 | New_Line;
0137 | Count := Count + 1;
0138 | end if;
0139 | end loop;
0140 | Close (file);
Of course if the candidate is a simple string, the compilation and the compiled expression are irrelevant but the general approach is identical.
If a replacement is requested, then instead of printing the results, the candidate is replaced by the replacement but written to a copy of the input file. Any lines which dont contain the candidate are copied verbatim to the output file:
0179 | Open (file, In_File, filename);
0180 | Create (outfile, Out_File, outfilename);
0181 | while not End_Of_File (file) loop
0182 | Get_Line (file, line, linelength);
0183 | GNAT.RegPat.Match (pcompiled.all, line (1 .. linelength), matched);
0184 | if matched (0) = GNAT.RegPat.No_Match then
0185 | Put_Line(outfile,line(1..linelength));
0186 | else
0187 | Put (outfile,line (1 .. matched (0).First - 1));
0188 | Put (outfile,replacement);
0189 | Put (outfile,line (matched (0).Last + 1 .. linelength));
0190 | New_Line(outfile);
0191 | Count := Count + 1;
0192 | end if;
0193 | end loop;
0194 | Close (outfile);
0195 | Close (file);
9.3.4 Exceptions
Handling and raising exceptions is a way to deal with anomalies particularly in deeply embedded code. Command Language processing utilizes this mechanism:
0009 | cli.ProcessCommandLine;
0010 | loop
0011 | declare
0012 | arg : constant String := cli.GetNextArgument;
0013 | begin
0014 | if arg'Length < 1 then
0015 | exit;
0016 | end if;
...
0028 | if cli.Replacement.all'Length >= 1 then
0029 | if cli.Candidate.all'Length >= 1 then
0030 | impl.Replace
0031 | (arg, cli.Candidate.all, cli.Replacement.all,
0032 | cli.outputdir.all);
0033 | elsif cli.CandidateExp.all'Length >= 1 then
0034 | impl.ReplaceRegEx
0035 | (arg, cli.CandidateExp.all, cli.Replacement.all,
0036 | cli.outputdir.all);
0037 | end if;
0038 | else
0039 | if cli.Candidate.all'Length >= 1 then
0040 | impl.Search (arg, cli.Candidate.all);
0041 | elsif cli.CandidateExp.all'Length >= 1 then
0042 | impl.SearchRegEx (arg, cli.CandidateExp.all);
0043 | end if;
0044 | end if;
0045 | end;
0046 | end loop;
9.4 Stretch
Instead of introducing special markers in the output when candidates are found in the input, highlight the text in the output for the strings found. Optionally find all instances in each line instead of stopping at the first successful find. Optionally handle binary files instead of just text files.
Code examples in the book have all been produced by the tool codemd which processes text files, looking for markup commands. Markup commands are specified and compiled as regular expressions somewhat patterned as above. The language agnostic implementation at https://gitlab.com/ada23/codemd.git can be improved e.g. to provide a caption and numbering for each segment of code.
In the GNAT programming system, there is a third tool - a family of packages GNAT.Spitbol which provide arguably the most powerful pattern matching support. Based on the decades old SNOBOL pattern matching system, this set of packages bring the same power with an API. Implementing a preprocessor for C to convert the enum definitions to be more like Ada e.g. by adding support for attributes like Image, Succ, Pred is a worthwhile exercise.