[m-rev.] diff: enhancements of lex

Holger Krug hkrug at rationalizer.com
Tue Aug 14 20:37:33 AEST 2001


For review by Ralph Beckett.

Branches: main

Hi Ralph,

If there still should remain some cosmetic changes it would be nice if
you could do them on your copy of the code, commit to the repository
and inform me by sending a relative diff.

Thanks, Holger

Summary of changes:

* the value contained in a token now may belong to an arbitrary 
  Mercury type, not only string as before
* lexemes now may contain a function to compute an arbitrary token
  value from the string matched
* the lexeme definition by the user now is more difficult,
  because the user has to name higher order functions
* the advantage is that no postprocessing is needed to evaluate
  the string, hence the interface to the consumer of the tokens is 
  cleaner
* predicate `lex__init/3' added which has as third argument a predicate
  succeeding on tokens which have to be ignored
* utility predicate `ignore/2' added which simplifies the definitions
  of ignore predicates needed in calls to `lex__init/3'
* typeclass regexp(T) added, the definition of type regexp now is hidden
  from the user
* function `->' added, which is a useful wrapper when defining lexemes
* for more information about these changes see the mercury-reviews
  thread starting at mercury-reviews/mercury-reviews.0107/0378.html and 
  continued at mercury-reviews/mercury-reviews.0108/0028.html
* common code factored out in predicates process_any_winner and process_eof

Index: README
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/README,v
retrieving revision 1.1
diff -u -r1.1 README
--- README	2001/02/21 16:29:36	1.1
+++ README	2001/08/14 10:25:53
@@ -1,12 +1,14 @@
 lex 1.0 (very alpha)
 Fri Aug 25 17:54:28  2000
 Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
-
     THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
     BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
     BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-
-
+Sun Aug  5 16:15:27 UTC 2001
+Copyright (C) 2001 The Rationalizer Intelligent Software AG
+    The changes made by Rationalizer are contributed under the terms 
+    of the GNU Free Documentation License - see the file COPYING in the
+    Mercury Distribution.
 
 This package defines a lexer for Mercury.  There is plenty of scope
 for optimization, however it is reasonably efficient and does provide
@@ -23,25 +25,32 @@
 
     :- type token
         --->    comment
-        ;       id
-        ;       num.
+        ;       id(string)
+        ;       num(int)
+	;       space.
 
 3. Set up a list of annotated_lexemes.
 
     Lexemes = [
-        lexeme(noval(comment),      (atom('%'), star(dot))),
-        lexeme(value(id),           identifier),
-        lexeme(ignore,              whitespace)
+	('%' >> *(dot))         -  return(comment),
+	identifier              -  (func(Match) = id(Match)),
+	signed_int              -  string__det_to_int,
+	whitespace              -  return(space)
     ]
 
-noval tokens are simply identified;
-value tokens are identified and returned with the string matched;
-ignore regexps are simply passed over.
 
+A lexeme is a pair (RegExp - TokFn) where RegExp is a
+regular expression and TokFn is a token_creator function mapping 
+the string matched by RegExp to a token value.
+
 4. Set up a lexer with an appropriate read predicate (see the buf module).
 
     Lexer = lex__init(Lexemes, lex__read_from_stdin)
 
+    or:
+
+    Lexer = lex__init(Lexemes, lex__read_from_stdin, ignore(space))
+     
 5. Obtain a live lexer state.
 
     State0 = lex__start(Lexer, IO0)
@@ -49,12 +58,21 @@
 6. Use it to lex the input stream.
 
     lex__read(Result, State0, State1),
-    ( Result = ok(NoValToken), ...
-    ; Result = ok(ValueToken, String), ...
-    ; Result = error(OffsetInInputStream), ...
+    ( Result = ok(Token), ...
+    ; Result = error(Message, OffsetInInputStream), ...
     ; Result = eof, ...
     )
 
+    NOTE: The result type of lex__read is io__read_result(token).
+    io__read_result is documented in the library file io.m as:
+    :- type io__read_result(T)      --->    ok(T)
+                                    ;       eof
+                                    ;       error(string, int).
+                                            % error message, line number
+    In contrast to this the `int' lex returns in the case of an error
+    does not correspond to the line number but to the character offset.
+    Hence be careful when processing lex errors.
+
 7. If you need to manipulate the source object, you can.
 
     lex__manipulate_source(io__print("Not finished yet?"), State1, State2)
@@ -69,7 +87,12 @@
 and the option to write out a compilable source file for the lexer.
 
 
+OPPORTUNITIES FOR MODULARIZATION
 
+1. Remove regexp functionality from lex.m and lex.regexp.m and put it into 
+   a distinct regexp library.
+
+
 OPPORTUNITIES FOR OPTIMIZATION
 
 1. Move from chars to bytes.
@@ -78,4 +101,16 @@
 3. Implement the first-byte optimization whereby the set of `live lexemes'
 is decided by the first byte read in on a lexing pass.
 4. Implement state machine minimization (may or may not be worthwhile.)
+
+
+FEATURES TO ADD:
+
+1. Symbol table management (additional parameters for the user-defined
+   predicates containing the symbol table before and after processing
+   a lexeme)
+2. func (string) = regexp, where the function parameter contains a
+   regexp definition in a form like used in languages in Perl, awk etc.
+3. line# as part of the offset
+4. extend the lexer interface somehow to get more detailed information
+   about the token resp. error position
 
Index: lex.buf.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.buf.m,v
retrieving revision 1.1
diff -u -r1.1 lex.buf.m
--- lex.buf.m	2001/02/21 16:29:36	1.1
+++ lex.buf.m	2001/08/14 10:25:53
@@ -1,8 +1,9 @@
 % ---------------------------------------------------------------------------- %
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
+%
 % lex.buf.m
 % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sat Aug 19 16:56:30 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
 %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
@@ -94,7 +95,7 @@
 
 :- interface.
 
-:- import_module int, array, char, bool, string, io.
+:- import_module array, char, bool, string.
 
 
 
@@ -171,7 +172,6 @@
 :- implementation.
 
 :- import_module exception.
-
 
 
     % The amount the buffer is grown by if (a) more space is
Index: lex.convert_NFA_to_DFA.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.convert_NFA_to_DFA.m,v
retrieving revision 1.1
diff -u -r1.1 lex.convert_NFA_to_DFA.m
--- lex.convert_NFA_to_DFA.m	2001/02/21 16:29:36	1.1
+++ lex.convert_NFA_to_DFA.m	2001/08/14 10:25:53
@@ -1,8 +1,10 @@
 %-----------------------------------------------------------------------------
+%
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
+%
 % lex.convert_NFA_to_DFA.m
 % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Fri Aug 18 12:30:25 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % Powerset construction used to transform NFAs into DFAs.
 %
Index: lex.lexeme.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.lexeme.m,v
retrieving revision 1.1
diff -u -r1.1 lex.lexeme.m
--- lex.lexeme.m	2001/02/21 16:29:36	1.1
+++ lex.lexeme.m	2001/08/14 10:25:53
@@ -1,16 +1,22 @@
 %-----------------------------------------------------------------------------
+%
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
+%
 % lex.lexeme.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sat Aug 19 08:22:32 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
+%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU Lesser General Public License.
 %
 % A lexeme combines a token with a regexp.  The lexer compiles
 % lexemes and returns the longest successul parse in the input
 % stream or an error if no match occurs.
 %
-%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
 %
 %----------------------------------------------------------------------------- %
 
@@ -22,15 +28,17 @@
 
 :- type compiled_lexeme(T)
     --->    compiled_lexeme(
-                clxm_token              :: T,
-                clxm_state              :: state_no,
-                clxm_transition_map     :: transition_map
+                token              :: token_creator(T),
+                state              :: state_no,
+                transition_map     :: transition_map
             ).
+:- inst compiled_lexeme(Inst)
+    --->    compiled_lexeme(Inst, ground, ground).
 
 :- type transition_map
     --->    transition_map(
-                trm_accepting_states    :: bitmap,
-                trm_rows                :: array(row)
+                accepting_states    :: bitmap,
+                rows                :: array(row)
             ).
 
     % A transition row is an array of byte_transitions.
@@ -59,12 +67,12 @@
     % an accepting state_no.
     %
 :- pred next_state(compiled_lexeme(T), state_no, char, state_no, bool).
-:- mode next_state(in, in, in, out, out) is semidet.
+:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
 
     % Succeeds iff a compiled_lexeme is in an accepting state_no.
     %
-:- pred in_accepting_state(compiled_lexeme(T)).
-:- mode in_accepting_state(in) is semidet.
+:- pred in_accepting_state(live_lexeme(T)).
+:- mode in_accepting_state(in(live_lexeme)) is semidet.
 
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
@@ -77,7 +85,7 @@
 %------------------------------------------------------------------------------%
 
 compile_lexeme(Lexeme) = CompiledLexeme :-
-    Lexeme         = lexeme(Token, RegExp),
+    Lexeme         = (RegExp - TokenCreator),
     NFA            = remove_null_transitions(regexp_to_NFA(RegExp)),
     DFA            = convert_NFA_to_DFA(NFA),
     StartState     = DFA ^ smc_start_state,
@@ -87,7 +95,7 @@
     Accepting      = set_accepting_states(StopStates, bitmap__new(N, no)),
     Rows           = array(set_up_rows(0, N, Transitions)),
     TransitionMap  = transition_map(Accepting, Rows),
-    CompiledLexeme = compiled_lexeme(Token, StartState, TransitionMap).
+    CompiledLexeme = compiled_lexeme(TokenCreator, StartState, TransitionMap).
 
 %------------------------------------------------------------------------------%
 
@@ -157,9 +165,11 @@
 %------------------------------------------------------------------------------%
 
 next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
-    Rows            = CLXM ^ clxm_transition_map ^ trm_rows,
-    AcceptingStates = CLXM ^ clxm_transition_map ^ trm_accepting_states,
-    find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
+    Rows            = CLXM ^ transition_map ^ rows,
+    AcceptingStates = CLXM ^ transition_map ^ accepting_states,
+    find_next_state(char__to_int(Char),
+				 Rows ^ elem(CurrentState),
+				 NextState),
     IsAccepting     = bitmap__get(AcceptingStates, NextState).
 
 %------------------------------------------------------------------------------%
@@ -189,7 +199,7 @@
 
 in_accepting_state(CLXM) :-
     bitmap__is_set(
-        CLXM ^ clxm_transition_map ^ trm_accepting_states, CLXM ^ clxm_state
+        CLXM ^ transition_map ^ accepting_states, CLXM ^ state
     ).
 
 %------------------------------------------------------------------------------%
Index: lex.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.m,v
retrieving revision 1.1
diff -u -r1.1 lex.m
--- lex.m	2001/02/21 16:29:36	1.1
+++ lex.m	2001/08/14 10:25:54
@@ -1,15 +1,19 @@
 %----------------------------------------------------------------------------- %
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
+%
 % lex.m
 % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sun Aug 20 09:08:46 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
-%
-% This module puts everything together, compiling a list of lexemes
-% into state machines and turning the input stream into a token stream.
-%
 %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
 %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
 %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU Lesser General Public License.
+%
+% This module puts everything together, compiling a list of lexemes
+% into state machines and turning the input stream into a token stream.
 %
 %----------------------------------------------------------------------------- %
 
@@ -18,37 +22,28 @@
 :- interface.
 
 :- import_module std_util, string, char, list, io.
-
-
 
-:- type annotated_lexeme(Token)
-    ==      lexeme(annotated_token(Token)).
+:- type token_creator(Token)
+    ==                        (func(string)=Token).
+:- inst token_creator
+    ==                        (func(in)=out is det).
 
 :- type lexeme(Token)
-    --->    lexeme(
-                lxm_token               :: Token,
-                lxm_regexp              :: regexp
-            ).
-
-:- type annotated_token(Token)
-    --->    noval(Token)                % Just return ok(Token) on success.
-    ;       value(Token)                % Return ok(Token, String) on success.
-    ;       ignore.                     % Just skip over these tokens.
+    ==                        pair(regexp, token_creator(Token)).
 
+:- inst lexeme(Inst)
+    ---> (ground - Inst).
+
 :- type lexer(Token, Source).
 :- inst lexer
-    ==      bound(lexer(ground, read_pred)).
+    ---> lexer(
+	       ground,
+	       ignore_pred,
+	       read_pred
+	      ).
 
 :- type lexer_state(Token, Source).
 
-:- type lexer_result(Token)
-    --->    ok(Token)                   % Noval token matched.
-    ;       ok(Token, string)           % Value token matched.
-    ;       eof                         % End of input.
-    ;       error(int).                 % No matches for string at this offset.
-
-
-
 :- type offset
     ==      int.                        % Byte offset into the source data.
 
@@ -66,27 +61,45 @@
 :- inst read_pred
     ==      ( pred(in, out, di, uo) is det ).
 
-
+    % ignore_pred(Token): if it does not fail, Token must be ignored
+    %
+:- type ignore_pred(Tok)
+    ==      pred(Tok).
+:- inst ignore_pred
+    ==      ( pred(in) is semidet ).
 
     % The type of regular expressions.
     %
-:- type regexp
-    --->    null                % The empty regexp
-    ;       atom(char)          % Match a single char
-    ;       (regexp >> regexp)  % Concatenation
-    ;       (regexp \/ regexp)  % Alternation
-    ;       star(regexp)        % Kleene closure
-    .
+:- type regexp.
 
+    % The typeclass for types having a natural converter to regexp's
+    %
+:- typeclass regexp(T) where [
+           func re(T) = regexp
+].
 
+    % Handling regexp's based on the typeclass regexp(T)
+    %
+:- func null       = regexp.
+:- func T1 ++ T2   = regexp  <= (regexp(T1), regexp(T2)).
+    % one of the following two functions will declared be deprecated
+    % later on, we still only have to decide which one
+:- func T1 \/ T2   = regexp  <= (regexp(T1), regexp(T2)).
+:- func (T1 or T2) = regexp  <= (regexp(T1), regexp(T2)).
+:- func *(T)       = regexp  <= regexp(T).
 
+    % Some instances of typeclass regexp(T)
+    %
+:- instance regexp(regexp).
+:- instance regexp(char).
+:- instance regexp(string).
+
     % Some basic non-primitive regexps.
     %
-:- func str(string) = regexp.   % str("abc") = atom(a) >> atom(b) >> atom(c)
-:- func any(string) = regexp.   % any("abc") = atom(a) \/ atom(b) \/ atom(c)
-:- func anybut(string) = regexp.% anybut("abc") is complement of any("abc")
-:- func opt(regexp) = regexp.   % opt(R)     = R \/ null
-:- func plus(regexp) = regexp.  % plus(R)    = R \/ star(R)
+:- func any(string) = regexp.        % any("abc") = ('a') \/ ('b') \/ ('c')
+:- func anybut(string) = regexp.     % anybut("abc") is complement of any("abc")
+:- func ?(T) = regexp <= regexp(T).  % ?(R)     = R \/ null
+:- func +(T) = regexp <= regexp(T).  % +(R)    = R \/ *(R)
 
     % Some useful single-char regexps.
     %
@@ -95,34 +108,54 @@
 :- func upper = regexp.         % upper      = any("ABC...Z")
 :- func alpha = regexp.         % alpha      = lower \/ upper
 :- func alphanum = regexp.      % alphanum   = alpha \/ digit
-:- func identstart = regexp.    % identstart = alpha \/ str("_")
-:- func ident = regexp.         % ident      = alphanum \/ str("_")
-:- func nl = regexp.            % nl         = str("\n")
-:- func tab = regexp.           % tab        = str("\t")
-:- func spc = regexp.           % spc        = str(" ")
+:- func identstart = regexp.    % identstart = alpha \/ "_"
+:- func ident = regexp.         % ident      = alphanum \/ "_"
+:- func nl = regexp.            % nl         = re("\n")
+:- func tab = regexp.           % tab        = re("\t")
+:- func spc = regexp.           % spc        = re(" ")
 :- func wspc = regexp.          % wspc       = any(" \t\n\r\f\v")
 :- func dot = regexp.           % dot        = any("<except \n>")
 
     % Some useful compound regexps.
     %
-:- func nat = regexp.           % nat        = plus(digit)
-:- func signed_int = regexp.    % signed_int = opt(any("+-")) >> nat
+:- func nat = regexp.           % nat        = +(digit)
+:- func signed_int = regexp.    % signed_int = ?(any("+-")) ++ nat
 :- func real = regexp.          % real       = \d+((.\d+([eE]int)?)|[eE]int)
-:- func identifier = regexp.    % identifier = identstart >> star(ident)
-:- func whitespace = regexp.    % whitespace = star(wspc)
-:- func junk = regexp.          % junk       = star(dot)
-
+:- func identifier = regexp.    % identifier = identstart ++ *(ident)
+:- func whitespace = regexp.    % whitespace = *(wspc)
+:- func junk = regexp.          % junk       = *(dot)
+
+   % Utility predicate to create ignore_pred's.
+   % Use it in the form `ignore(my_token)' to ignore just `my_token'.
+   % 
+:- pred ignore(Token::in, Token::in) is semidet.
+
+   % Utility function to return noval tokens.
+   % Use it in the form `return(my_token) inside a lexeme definition.
+   %
+:- func return(T,string) = T.
+
+   % Utility operator to create lexemes.
+:- func (T1 -> token_creator(Tok)) = pair(regexp,token_creator(Tok))
+	<= regexp(T1).
 
-
     % Construct a lexer from which we can generate running
     % instances.
     %
-:- func init(list(annotated_lexeme(Tok)), read_pred(Src)) = lexer(Tok, Src).
+:- func init(list(lexeme(Tok)), read_pred(Src)) = lexer(Tok, Src).
 :- mode init(in, in(read_pred)) = out(lexer) is det.
 
-    % Handy read predicates.
+    % Construct a lexer from which we can generate running
+    % instances. If we construct a lexer with init/4, we
+    % can additionally ignore specific tokens.
     %
-:- pred read_from_stdin(offset, read_result, io__state, io__state).
+:- func init(list(lexeme(Tok)), read_pred(Src), ignore_pred(Tok)) =
+	lexer(Tok, Src).
+:- mode init(in, in(read_pred), in(ignore_pred)) = out(lexer) is det.
+
+				% Handy read predicates.
+    %
+:- pred read_from_stdin(offset, read_result, io, io).
 :- mode read_from_stdin(in, out, di, uo) is det.
 
 :- pred read_from_string(offset, read_result, string, string).
@@ -135,7 +168,8 @@
 :- func start(lexer(Tok, Src), Src) = lexer_state(Tok, Src).
 :- mode start(in(lexer), di) = uo is det.
 
-:- pred read(lexer_result(Tok), lexer_state(Tok, Src), lexer_state(Tok, Src)).
+:- pred read(io__read_result(Tok), lexer_state(Tok, Src),
+	     lexer_state(Tok, Src)).
 :- mode read(out, di, uo) is det.
 
     % Stop a running instance of a lexer and retrieve the input source.
@@ -143,7 +177,7 @@
 :- func stop(lexer_state(_Tok, Src)) = Src.
 :- mode stop(di) = uo is det.
 
-    % Sometimes (e.g. when lexing the io__state) you want access to the
+    % Sometimes (e.g. when lexing the io__io) you want access to the
     % input stream without interrupting the lexing process.  This pred
     % provides that sort of access.
     %
@@ -170,39 +204,64 @@
 
 :- type lexer(Token, Source)
     --->    lexer(
-                lex_compiled_lexemes    :: list(live_lexeme(Token)),
-                lex_buf_read_pred       :: read_pred(Source)
-            ).
-
-
+		  lex_compiled_lexemes    :: list(live_lexeme(Token)),
+		  lex_ignore_pred         :: ignore_pred(Token),
+		  lex_buf_read_pred       :: read_pred(Source)
+		 ).
 
 :- type lexer_instance(Token, Source)
     --->    lexer_instance(
-                lexi_init_lexemes       :: list(live_lexeme(Token)),
-                lexi_live_lexemes       :: list(live_lexeme(Token)),
-                lexi_current_winner     :: winner(Token),
-                lexi_buf_state          :: buf_state(Source)
-            ).
-:- inst lexer_instance
-    ==      bound(lexer_instance(ground, ground, ground, buf_state)).
-
+			   init_lexemes       :: list(live_lexeme(Token)),
+			   live_lexemes       :: list(live_lexeme(Token)),
+			   current_winner     :: winner(Token),
+			   buf_state          :: buf_state(Source),
+			   ignore_pred        :: ignore_pred(Token)
+			  ).
 
+:- inst lexer_instance
+    --->      lexer_instance(
+			     live_lexeme_list, 
+                             live_lexeme_list, 
+	                     winner, 
+                             buf__buf_state,
+			     ignore_pred
+			    ).
 
 :- type live_lexeme(Token)
-    ==      compiled_lexeme(annotated_token(Token)).
+    ==      compiled_lexeme(Token).
+:- inst live_lexeme
+    ==      compiled_lexeme(token_creator).
+:- inst live_lexeme_list
+    ==      list__list_skel(live_lexeme).
 
 
 
 :- type winner(Token)
-    ==      maybe(pair(annotated_token(Token), offset)).
+    ==      maybe(pair(token_creator(Token), offset)).
+:- inst winner
+    ---> yes(bound(token_creator - ground))
+    ;    no.
 
 %----------------------------------------------------------------------------- %
+ignore(Tok,Tok).
 
-init(Lexemes, BufReadPred) = lexer(CompiledLexemes, BufReadPred) :-
-    CompiledLexemes = list__map(lexeme__compile_lexeme, Lexemes).
+%----------------------------------------------------------------------------- %
+return(Token, _) = Token.
 
 %----------------------------------------------------------------------------- %
+(R1 -> TC) = (re(R1) - TC).
 
+%----------------------------------------------------------------------------- %
+
+init(Lexemes, BufReadPred) = init(Lexemes, BufReadPred, DontIgnoreAnything) :-
+    DontIgnoreAnything = ( pred(_::in) is semidet :- semidet_fail ).
+
+init(Lexemes, BufReadPred, IgnorePred) =
+    lexer(CompiledLexemes, IgnorePred, BufReadPred) :-
+    CompiledLexemes = list__map(compile_lexeme, Lexemes).
+
+%----------------------------------------------------------------------------- %
+
 start(Lexer, Src) = LexerState :-
     init_lexer_instance(Lexer, Instance, Buf),
     LexerState = args_lexer_state(Instance, Buf, Src).
@@ -215,7 +274,9 @@
 init_lexer_instance(Lexer, Instance, Buf) :-
     buf__init(Lexer ^ lex_buf_read_pred, BufState, Buf),
     InitLexemes = Lexer ^ lex_compiled_lexemes,
-    Instance    = lexer_instance(InitLexemes, InitLexemes, no, BufState).
+    IgnorePred  = Lexer ^ lex_ignore_pred,
+    Instance    = lexer_instance(InitLexemes, InitLexemes, no,
+				 BufState, IgnorePred).
 
 %----------------------------------------------------------------------------- %
 
@@ -231,7 +292,7 @@
 
 
 
-:- pred read_0(lexer_result(Tok),
+:- pred read_0(io__read_result(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
             buf, buf, Src, Src).
 :- mode read_0(out,
@@ -243,7 +304,7 @@
     %
 read_0(Result, Instance0, Instance, Buf0, Buf, Src0, Src) :-
 
-    BufState0    = Instance0 ^ lexi_buf_state,
+    BufState0    = Instance0 ^ buf_state,
 
     buf__read(BufReadResult, BufState0, BufState1, Buf0, Buf1, Src0, Src1),
     (
@@ -259,7 +320,7 @@
 
 %----------------------------------------------------------------------------- %
 
-:- pred process_char(lexer_result(Tok), char,
+:- pred process_char(io__read_result(Tok), char,
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
             buf_state(Src), buf, buf, Src, Src).
 :- mode process_char(out, in, in(lexer_instance), out(lexer_instance),
@@ -268,8 +329,8 @@
 process_char(Result, Char, Instance0, Instance,
         BufState, Buf0, Buf, Src0, Src) :-
 
-    LiveLexemes0 = Instance0 ^ lexi_live_lexemes,
-    Winner0      = Instance0 ^ lexi_current_winner,
+    LiveLexemes0 = Instance0 ^ live_lexemes,
+    Winner0      = Instance0 ^ current_winner,
 
     advance_live_lexemes(Char, buf__cursor_offset(BufState),
             LiveLexemes0, LiveLexemes, Winner0, Winner),
@@ -282,44 +343,44 @@
         LiveLexemes = [_ | _],          % Still some open possibilities.
 
         Instance1 = (((Instance0
-                            ^ lexi_live_lexemes   := LiveLexemes)
-                            ^ lexi_current_winner := Winner)
-                            ^ lexi_buf_state      := BufState),
+                            ^ live_lexemes   := LiveLexemes)
+                            ^ current_winner := Winner)
+                            ^ buf_state      := BufState),
         read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
     ).
 
 %----------------------------------------------------------------------------- %
 
-:- pred process_any_winner(lexer_result(Tok), winner(Tok),
+:- pred process_any_winner(io__read_result(Tok), winner(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src), 
             buf_state(Src), buf, buf, Src, Src).
-:- mode process_any_winner(out, in,
+:- mode process_any_winner(out, in(winner),
             in(lexer_instance), out(lexer_instance),
             in(buf_state), array_di, array_uo, di, uo) is det.
 
-process_any_winner(Result, yes(ATok - Offset), Instance0, Instance,
+process_any_winner(Result, yes(TokenCreator - Offset), Instance0, Instance,
         BufState0, Buf0, Buf, Src0, Src) :-
 
     BufState1 = buf__rewind_cursor(Offset, BufState0),
     Instance1 = ((( Instance0
-                        ^ lexi_live_lexemes   := Instance0 ^ lexi_init_lexemes)
-                        ^ lexi_current_winner := no)
-                        ^ lexi_buf_state      := buf__commit(BufState1)),
+                        ^ live_lexemes   := Instance0 ^ init_lexemes)
+                        ^ current_winner := no)
+                        ^ buf_state      := buf__commit(BufState1)),
     (
-        ATok     = noval(Token),
-        Result   = ok(Token),
-        Instance = Instance1,
-        Buf      = Buf0,
-        Src      = Src0
-    ;
-        ATok     = value(Token),
-        Result   = ok(Token, buf__string_to_cursor(BufState1, Buf)),
-        Instance = Instance1,
-        Buf      = Buf0,
-        Src      = Src0
-    ;
-        ATok     = ignore,
-        read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
+      if
+
+         get_token_from_buffer(BufState1, Buf0, Instance0,
+			       TokenCreator, Token)
+      then
+    
+         Result   = ok(Token),
+         Instance = Instance1,
+         Buf      = Buf0,
+         Src      = Src0
+    
+      else
+    
+         read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
     ).
 
 process_any_winner(Result, no, Instance0, Instance,
@@ -327,16 +388,16 @@
 
     Start     = buf__start_offset(BufState0),
     BufState1 = buf__rewind_cursor(Start + 1, BufState0),
-    Result    = error(Start),
+    Result    = error("input not matched by any regexp", Start),
     Instance  = ((( Instance0
-                        ^ lexi_live_lexemes   :=
-                                Instance0 ^ lexi_init_lexemes)
-                        ^ lexi_current_winner := no)
-                        ^ lexi_buf_state      := buf__commit(BufState1)).
+                        ^ live_lexemes   :=
+                                Instance0 ^ init_lexemes)
+                        ^ current_winner := no)
+                        ^ buf_state      := buf__commit(BufState1)).
 
 %----------------------------------------------------------------------------- %
 
-:- pred process_eof(lexer_result(Tok),
+:- pred process_eof(io__read_result(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
             buf_state(Src), buf).
 :- mode process_eof(out, in(lexer_instance), out(lexer_instance),
@@ -346,44 +407,63 @@
 
     ( if
 
-        live_lexeme_in_accepting_state(Instance0 ^ lexi_live_lexemes, ATok)
+        live_lexeme_in_accepting_state(Instance0 ^ live_lexemes, TokenCreator)
 
       then
 
             % Return the token and set things up so that we return
             % eof next.
-        (
-            ATok   = noval(Token),
-            Result = ok(Token)
-        ;
-            ATok   = value(Token),
-            Result = ok(Token, buf__string_to_cursor(BufState, Buf))
-        ;
-            ATok   = ignore,
-            Result = eof
-        )
+            %
+	    (
+	      if
+	    
+	         get_token_from_buffer(BufState, Buf, Instance0,
+				       TokenCreator, Token)
+	      then
+	    
+	         Result = ok(Token)
+	    
+	      else
 
+	         Result = eof
+	    )
+
       else
 
         Result     = eof
     ),
     Instance  = ((Instance0
-                        ^ lexi_live_lexemes := [])
-                        ^ lexi_buf_state    := buf__commit(BufState)).
+                        ^ live_lexemes := [])
+                        ^ buf_state    := buf__commit(BufState)).
+
+%----------------------------------------------------------------------------- %
+
+:- pred get_token_from_buffer(buf_state(Src), buf, lexer_instance(Tok, Src),
+			      token_creator(Tok), Tok).
+:- mode get_token_from_buffer(in(buf_state), array_ui, in(lexer_instance),
+			      in(token_creator), out) is semidet.
+
+get_token_from_buffer(BufState, Buf, Instance, TokenCreator, Token) :-
+    Match     = buf__string_to_cursor(BufState, Buf),
+    Token     = TokenCreator(Match),
+    IgnorePred = Instance ^ ignore_pred,
+    \+ IgnorePred(Token).
 
 %----------------------------------------------------------------------------- %
 
 :- pred advance_live_lexemes(char, offset,
             list(live_lexeme(Token)), list(live_lexeme(Token)),
             winner(Token), winner(Token)).
-:- mode advance_live_lexemes(in, in, in, out, in, out) is det.
+:- mode advance_live_lexemes(in, in, in(live_lexeme_list), 
+                             out(live_lexeme_list), 
+			     in(winner), out(winner)) is det.
 
 advance_live_lexemes(_Char, _Offset, [], [], Winner, Winner).
 
 advance_live_lexemes(Char, Offset, [L0 | Ls0], Ls, Winner0, Winner) :-
 
-    State0        = L0 ^ clxm_state,
-    ATok          = L0 ^ clxm_token,
+    State0        = L0 ^ state,
+    ATok          = L0 ^ token,
 
     ( if next_state(L0, State0, Char, State, IsAccepting) then
 
@@ -395,7 +475,7 @@
             Winner1     = yes(ATok - Offset)
         ),
         advance_live_lexemes(Char, Offset, Ls0, Ls1, Winner1, Winner),
-        Ls = [( L0 ^ clxm_state := State ) | Ls1]
+        Ls = [( L0 ^ state := State ) | Ls1]
 
       else
 
@@ -405,15 +485,17 @@
 %----------------------------------------------------------------------------- %
 
 :- pred live_lexeme_in_accepting_state(list(live_lexeme(Tok)),
-                annotated_token(Tok)).
-:- mode live_lexeme_in_accepting_state(in, out) is semidet.
+                token_creator(Tok)).
+:- mode live_lexeme_in_accepting_state(in(live_lexeme_list), 
+	                  out(token_creator)) is semidet.
 
 live_lexeme_in_accepting_state([L | Ls], Token) :-
     ( if in_accepting_state(L)
-      then Token = L ^ clxm_token
+      then Token = L ^ token
       else live_lexeme_in_accepting_state(Ls, Token)
     ).
 
+
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
 
@@ -423,10 +505,10 @@
 
 :- type lexer_state(Tok, Src)
     --->    lexer_state(
-                lxr_instance            :: lexer_instance(Tok, Src),
-                lxr_buf                 :: buf,
-                lxr_src                 :: Src
-            ).
+			run                 :: lexer_instance(Tok, Src),
+			buf                 :: buf,
+			src                 :: Src
+		       ).
 
 %------------------------------------------------------------------------------%
 
@@ -471,17 +553,49 @@
         Result = eof
     ).
 
+
 %----------------------------------------------------------------------------- %
-% Some basic non-primitive regexps.
+% The type of regular expressions.
+:- type regexp
+    --->    eps                    % The empty regexp
+    ;       atom(char)             % Match a single char
+    ;       conc(regexp,regexp)    % Concatenation
+    ;       alt(regexp, regexp)    % Alternation
+    ;       star(regexp)           % Kleene closure
+    .
 
-str(S) = R :-
+%----------------------------------------------------------------------------- %
+% Some instances of typeclass regexp(T)
+:- instance regexp(regexp) where [
+	  re(RE) = RE
+].
+
+:- instance regexp(char) where [
+	  re(C) = atom(C)
+].
+
+:- instance regexp(string) where [
+	  re(S) =  R :-
     ( if S = "" then
         R = null
       else
         L = string__length(S),
         C = string__index_det(S, L - 1),
-        R = str_foldr(func(Cx, Rx) = (atom(Cx) >> Rx), S, atom(C), L - 2)
-    ).
+        R = str_foldr(func(Cx, Rx) = (Cx ++ Rx), S, re(C), L - 2)
+    )
+].
+
+
+%----------------------------------------------------------------------------- %
+% Basic primitive regexps.
+null = eps.
+R1 ++ R2   = conc(re(R1), re(R2)).
+R1 \/ R2   = alt(re(R1), re(R2)).
+(R1 or R2) = alt(re(R1), re(R2)).
+*(R1)      = star(re(R1)).
+
+%----------------------------------------------------------------------------- %
+% Some basic non-primitive regexps.
 
 any(S) = R :-
     ( if S = "" then
@@ -489,7 +603,7 @@
       else
         L = string__length(S),
         C = string__index_det(S, L - 1),
-        R = str_foldr(func(Cx, Rx) = (atom(Cx) \/ Rx), S, atom(C), L - 2)
+        R = str_foldr(func(Cx, Rx) = (Cx \/ Rx), S, re(C), L - 2)
     ).
 
 anybut(S0) = R :-
@@ -511,9 +625,9 @@
                else str_foldr(Fn, S, Fn(string__index_det(S, I), X), I - 1)
     ).
 
-opt(R)  = (R \/ null).
+?(R)  = (R \/ null).
 
-plus(R) = (R >> star(R)).
++(R) = (R ++ *(R)).
 
 %----------------------------------------------------------------------------- %
 % Some useful single-char regexps.
@@ -535,24 +649,24 @@
 dot        = anybut("\n").
 alpha      = (lower \/ upper).
 alphanum   = (alpha \/ digit).
-identstart = (alpha \/ atom('_')).
-ident      = (alphanum \/ atom('_')).
-nl         = atom('\n').
-tab        = atom('\t').
-spc        = atom(' ').
+identstart = (alpha \/ ('_')).
+ident      = (alphanum \/ ('_')).
+nl         = re('\n').
+tab        = re('\t').
+spc        = re(' ').
 
 %----------------------------------------------------------------------------- %
 % Some useful compound regexps.
 
-nat        = plus(digit).
-signed_int = (opt(any("+-")) >> nat).
-real       = (signed_int >> (
-                (atom('.') >> nat >> opt(any("eE") >> signed_int)) \/
-                (any("eE") >> signed_int)
+nat        = +(digit).
+signed_int = (?(any("+-")) ++ nat).
+real       = (signed_int ++ (
+                ('.' ++ nat ++ ?(any("eE") ++ signed_int)) \/
+                (any("eE") ++ signed_int)
              )).
-identifier = (identstart >> star(ident)).
-whitespace = star(wspc).
-junk       = star(dot).
+identifier = (identstart ++ *(ident)).
+whitespace = *(wspc).
+junk       = *(dot).
 
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
Index: lex.regexp.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.regexp.m,v
retrieving revision 1.1
diff -u -r1.1 lex.regexp.m
--- lex.regexp.m	2001/02/21 16:29:37	1.1
+++ lex.regexp.m	2001/08/14 10:25:54
@@ -1,15 +1,19 @@
 %----------------------------------------------------------------------------- %
-% lex.regexp.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
-% Fri Aug 18 06:43:09 BST 2000
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
-%
-% Converts basic regular expressions into non-deterministic finite
-% automata (NFAs).
 %
+% lex.regexp.m
+% Fri Aug 18 06:43:09 BST 2000
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
 %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
 %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU Lesser General Public License.
+%
+% Converts basic regular expressions into non-deterministic finite
+% automata (NFAs).
 %
 %----------------------------------------------------------------------------- %
 
@@ -53,16 +57,16 @@
 
     % The primitive regexps.
 
-compile(X, null,       Y, [null(X, Y)]) --> [].
+compile(X, eps,        Y, [null(X, Y)]) --> [].
 
 compile(X, atom(C),    Y, [trans(X, C, Y)]) --> [].
 
-compile(X, (RA >> RB), Y, TsA ++ TsB) -->
+compile(X, conc(RA,RB), Y, TsA ++ TsB) -->
     counter__allocate(Z),
     compile(X, RA, Z, TsA),
     compile(Z, RB, Y, TsB).
 
-compile(X, (RA \/ RB), Y, TsA ++ TsB) -->
+compile(X, alt(RA,RB), Y, TsA ++ TsB) -->
     compile(X, RA, Y, TsA),
     compile(X, RB, Y, TsB).
 
Index: samples/demo.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/samples/demo.m,v
retrieving revision 1.1
diff -u -r1.1 demo.m
--- samples/demo.m	2001/02/21 16:29:39	1.1
+++ samples/demo.m	2001/08/14 10:25:54
@@ -1,13 +1,18 @@
 %----------------------------------------------------------------------------- %
 % demo.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sun Aug 20 18:11:42 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
-%
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
 %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
 %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU General Public License - see the file COPYING in the
+%   Mercury Distribution.
 %
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
+%
 %----------------------------------------------------------------------------- %
 
 :- module demo.
@@ -23,7 +28,7 @@
 
 :- implementation.
 
-:- import_module char, string, exception, array, list.
+:- import_module char, string, exception, array, list, std_util, int.
 :- import_module lex.
 
 %----------------------------------------------------------------------------- %
@@ -40,7 +45,7 @@
 
 "),
 
-    { Lexer  = lex__init(lexemes, lex__read_from_stdin) },
+    { Lexer  = lex__init(lexemes, lex__read_from_stdin, ignore(space)) },
     call((pred(IO0::di, IO::uo) is det :-
             State0 = lex__start(Lexer, IO0),
             tokenise_stdin(State0, State),
@@ -66,48 +71,53 @@
 %----------------------------------------------------------------------------- %
 
 :- type token
-    --->    noun
-    ;       comment
-    ;       integer
-    ;       real
-    ;       verb
-    ;       conj
-    ;       prep
+    --->    noun(string)
+    ;       comment(string)
+    ;       integer(int)
+    ;       real(float)
+    ;       verb(string)
+    ;       conj(string)
+    ;       prep(string)
     ;       punc
+    ;       space
     .
 
-:- func lexemes = list(annotated_lexeme(token)).
+:- func lexemes = list(lexeme(token)).
 
 lexemes = [
-    lexeme(value(comment),
-			(atom('%') >> junk)),
-
-    lexeme(value(integer),
-			(signed_int)),
-
-    lexeme(value(real),
-            (real)),
-
-    lexeme(value(noun), str("cat")),
-    lexeme(value(noun), str("dog")),
-    lexeme(value(noun), str("rat")),
-    lexeme(value(noun), str("mat")),
-
-    lexeme(value(verb),
-			(str("sat") \/ str("caught") \/ str("chased"))),
+    ("%" ++ junk)        -  (func(Match) = comment(Match)) ,
+    (signed_int)         -  (func(Match) = integer(string__det_to_int(Match))),
+    (real)               -  (func(Match) = real(det_string_to_float(Match))) ,
+    re("cat")            -  (func(Match) = noun(Match)) ,
+    re("dog")            -  (func(Match) = noun(Match)) ,
+    % Using `->' instead of `-' it's not necessary to call `re':
+    "rat"                -> (func(Match) = noun(Match)) ,
+    "mat"                -> (func(Match) = noun(Match)) ,
+    % Here we use `or' 
+    ("sat" or "caught" or "chased") -
+	                    (func(Match) = verb(Match)) ,
+    ("and" or "then")    -
+	                    (func(Match) = conj(Match)) ,
+    % Now we use `\/', it's the same as `or'. We would like to
+    % know from you, which one looks nicer.
+    ("the" \/ "it" \/ "them" \/ "to" \/ "on") -
+	                    (func(Match) = prep(Match)) ,
+    any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./") -
+                            return(punc) ,
+	
+    whitespace           -  return(space)
+].
 
-    lexeme(value(conj),
-			(str("and") \/ str("then"))),
+:- func det_string_to_float(string) = float.
 
-    lexeme(value(prep),
-			(str("the") \/ str("it") \/ str("them") \/ str("to") \/ str("on"))),
+:- import_module require.
 
-    lexeme(noval(punc),
-			(any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./"))),
+det_string_to_float(String) = Float :-
+	string__to_float(String,Float) 
+        ; 
+	error("Floating point number overflow").
 
-    lexeme(ignore,
-			whitespace)
-].
-
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
+
+
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at cs.mu.oz.au
administrative address: owner-mercury-reviews at cs.mu.oz.au
unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list