[m-rev.] For review: lex and moose changed

Holger Krug hkrug at rationalizer.com
Mon Aug 6 02:55:04 AEST 2001


Here comes the implementation of what was discussed in this thread last
week.

In the following points I took the liberty to deviate from what was
discussed:

* I did not change the operator `\/' and `/\' to `or' and `++'
  because the operators `\/' and `/\' in my view greatly enhance
  the readability of regular expression.
* I did not remove the prefix `lxr_'. It's necessary, because one
  field is named `lxr_instance'.
* I changed the determinism of `lex__read/3' to `cc_multi'. This was
  necessary to allow the user to throw exceptions within the
  token functions. According to Fergus' recommendations in 
  mercury-users/mercury-users.0011/006{2,4}.html it would not be allowed
  in a situation like this to use `promise_only_solution' because the
  functions throwing the exceptions to be caught are defined by the
  user.
* The tokens to be ignored can be given by a predicate of type `pred(Token)',
  no, as proposed by Ralph, by a value of type `maybe(Token)'.

The CVS log message should be like:

* the value contained in a token now may belong to an arbitrary 
  Mercury type, not only string as before
* lexemes now may contain a function to compute an arbitrary token
  value from the string matched
* the lexeme definition by the user now is more difficult,
  because the user has to name higher order functions
* the advantage is that no postprocessing is needed to evaluate
  the string, hence the interface to the consumer of the tokens is 
  cleaner
* determinism of lex__read/3 changed from `det' to `cc_multi' to allow
  the user to throw a string-typed exception inside a token function, 
  from which an error token is formed. The exception is caught in
  the body of lex__process_any_winner/9. Following 
  mercury-users/mercury-users.0011/0062.html and 
  mercury-users/mercury-users.00110064.html it would not be sound to use 
 `promise_only_solution'. `promise_only_solution', because the 
  exception is user defined and not controlled inside lex. The user
  might well provide a token function which throws more than one
  string-typed exception.
* predicate lex__init/3 added which has as third argument a predicate
  succeeding on tokens which have to be ignored
* typeclass regexp(T) added, the definition of type now is hidden
  from the user
* function `->' added, which is a useful wrapper when defining lexemes
* for more information about these changes see the mercury-reviews
  thread starting at mercury-reviews/mercury-reviews.0107/0378.html and 
  continued at mercury-reviews/mercury-reviews.0108/0028.html

The diff is appended.

The corresponding diff to moose follows soon. I still have to change
the determinism to `cc_multi'.

-- 
Holger Krug
hkrug at rationalizer.com
-------------- next part --------------
Index: README
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/README,v
retrieving revision 1.1
diff -u -r1.1 README
--- README	2001/02/21 16:29:36	1.1
+++ README	2001/08/05 16:36:40
@@ -1,12 +1,14 @@
 lex 1.0 (very alpha)
 Fri Aug 25 17:54:28  2000
 Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
-
     THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
     BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
     BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-
-
+Sun Aug  5 16:15:27 UTC 2001
+Copyright (C) 2001 The Rationalizer Intelligent Software AG
+    The changes made by Rationalizer are contributed under the terms 
+    of the GNU Free Documentation License - see the file COPYING in the
+    Mercury Distribution.
 
 This package defines a lexer for Mercury.  There is plenty of scope
 for optimization, however it is reasonably efficient and does provide
@@ -23,25 +25,34 @@
 
     :- type token
         --->    comment
-        ;       id
-        ;       num.
+        ;       id(string)
+        ;       num(int)
+	;       space.
 
 3. Set up a list of annotated_lexemes.
 
     Lexemes = [
-        lexeme(noval(comment),      (atom('%'), star(dot))),
-        lexeme(value(id),           identifier),
-        lexeme(ignore,              whitespace)
+	('%' >> *(dot))         -  return(comment),
+	identifier              -  (func(Match) = id(Match)),
+	signed_int              -  string__det_to_int,
+	whitespace              -  space
     ]
 
-noval tokens are simply identified;
-value tokens are identified and returned with the string matched;
-ignore regexps are simply passed over.
+Lexemes are pairs. Their first entry is a regular expression. The 
+second entry is a token_creator, i.e. a function to compute tokens
+from strings, namely from the string matched by the
+regular expression forming the first part of the lexeme.
 
 4. Set up a lexer with an appropriate read predicate (see the buf module).
 
     Lexer = lex__init(Lexemes, lex__read_from_stdin)
 
+    or:
+
+    :- pred ignore_space(token::in) is semidet.
+    ignore_space(space).
+    Lexer = lex__init(Lexemes, lex__read_from_stdin, ignore_space)
+     
 5. Obtain a live lexer state.
 
     State0 = lex__start(Lexer, IO0)
@@ -49,12 +60,21 @@
 6. Use it to lex the input stream.
 
     lex__read(Result, State0, State1),
-    ( Result = ok(NoValToken), ...
-    ; Result = ok(ValueToken, String), ...
-    ; Result = error(OffsetInInputStream), ...
+    ( Result = ok(Token), ...
+    ; Result = error(Message, OffsetInInputStream), ...
     ; Result = eof, ...
     )
 
+    Attention: The result type of lex__read is io__read_result(token).
+    io__read_result is documented in the library file io.m as:
+    :- type io__read_result(T)      --->    ok(T)
+                                    ;       eof
+                                    ;       error(string, int).
+                                            % error message, line number
+    In contrast to this the `int' lex returns in the case of an error
+    does not correspond to the line number but to the character offset.
+    Hence be careful when processing lex errors.
+
 7. If you need to manipulate the source object, you can.
 
     lex__manipulate_source(io__print("Not finished yet?"), State1, State2)
@@ -69,7 +89,12 @@
 and the option to write out a compilable source file for the lexer.
 
 
+OPPORTUNITIES FOR MODULARIZATION
 
+1. Remove regexp functionality from lex and put it into distinct
+   regexp library.
+
+
 OPPORTUNITIES FOR OPTIMIZATION
 
 1. Move from chars to bytes.
@@ -79,3 +104,18 @@
 is decided by the first byte read in on a lexing pass.
 4. Implement state machine minimization (may or may not be worthwhile.)
 
+
+FEATURES TO ADD:
+
+1. Symbol table management (additional parameters for the user-defined
+   predicates containing the symbol table before and after processing
+   a lexeme)
+2. func (string) = regexp, where the function parameter contains a
+   regexp definition in a form like used in languages in Perl, awk etc.
+3. line# as part of the offset
+4. the lexer-result type is changed to:
+   ok(token, token_position)
+   where token_position contains information like: offset, line nr, offset
+   in line, length in chars, lenght in lines, etc.
+5. interface to editors (emacs, vim) to allow to move the editor cursor
+   to the lexer position
Index: lex.buf.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.buf.m,v
retrieving revision 1.1
diff -u -r1.1 lex.buf.m
--- lex.buf.m	2001/02/21 16:29:36	1.1
+++ lex.buf.m	2001/08/05 16:36:40
@@ -94,7 +94,7 @@
 
 :- interface.
 
-:- import_module int, array, char, bool, string, io.
+:- import_module array, char, bool, string.
 
 
 
@@ -171,7 +171,6 @@
 :- implementation.
 
 :- import_module exception.
-
 
 
     % The amount the buffer is grown by if (a) more space is
Index: lex.lexeme.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.lexeme.m,v
retrieving revision 1.1
diff -u -r1.1 lex.lexeme.m
--- lex.lexeme.m	2001/02/21 16:29:36	1.1
+++ lex.lexeme.m	2001/08/05 16:36:40
@@ -1,16 +1,21 @@
 %-----------------------------------------------------------------------------
 % lex.lexeme.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sat Aug 19 08:22:32 BST 2000
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
+%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU Lesser General Public License.
+%
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % A lexeme combines a token with a regexp.  The lexer compiles
 % lexemes and returns the longest successul parse in the input
 % stream or an error if no match occurs.
 %
-%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
 %
 %----------------------------------------------------------------------------- %
 
@@ -22,15 +27,17 @@
 
 :- type compiled_lexeme(T)
     --->    compiled_lexeme(
-                clxm_token              :: T,
-                clxm_state              :: state_no,
-                clxm_transition_map     :: transition_map
+                token              :: token_creator(T),
+                state              :: state_no,
+                transition_map     :: transition_map
             ).
+:- inst compiled_lexeme(Inst) --->
+            lex__lexeme__compiled_lexeme(Inst,ground,ground).
 
 :- type transition_map
     --->    transition_map(
-                trm_accepting_states    :: bitmap,
-                trm_rows                :: array(row)
+                accepting_states    :: bitmap,
+                rows                :: array(row)
             ).
 
     % A transition row is an array of byte_transitions.
@@ -59,12 +66,12 @@
     % an accepting state_no.
     %
 :- pred next_state(compiled_lexeme(T), state_no, char, state_no, bool).
-:- mode next_state(in, in, in, out, out) is semidet.
+:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
 
     % Succeeds iff a compiled_lexeme is in an accepting state_no.
     %
-:- pred in_accepting_state(compiled_lexeme(T)).
-:- mode in_accepting_state(in) is semidet.
+:- pred in_accepting_state(live_lexeme(T)).
+:- mode in_accepting_state(in(live_lexeme)) is semidet.
 
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
@@ -77,7 +84,7 @@
 %------------------------------------------------------------------------------%
 
 compile_lexeme(Lexeme) = CompiledLexeme :-
-    Lexeme         = lexeme(Token, RegExp),
+    Lexeme         = (RegExp - TokenCreator),
     NFA            = remove_null_transitions(regexp_to_NFA(RegExp)),
     DFA            = convert_NFA_to_DFA(NFA),
     StartState     = DFA ^ smc_start_state,
@@ -87,7 +94,7 @@
     Accepting      = set_accepting_states(StopStates, bitmap__new(N, no)),
     Rows           = array(set_up_rows(0, N, Transitions)),
     TransitionMap  = transition_map(Accepting, Rows),
-    CompiledLexeme = compiled_lexeme(Token, StartState, TransitionMap).
+    CompiledLexeme = compiled_lexeme(TokenCreator, StartState, TransitionMap).
 
 %------------------------------------------------------------------------------%
 
@@ -157,9 +164,9 @@
 %------------------------------------------------------------------------------%
 
 next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
-    Rows            = CLXM ^ clxm_transition_map ^ trm_rows,
-    AcceptingStates = CLXM ^ clxm_transition_map ^ trm_accepting_states,
-    find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
+    Rows            = CLXM ^ transition_map ^ rows,
+    AcceptingStates = CLXM ^ transition_map ^ accepting_states,
+    lex__lexeme__find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
     IsAccepting     = bitmap__get(AcceptingStates, NextState).
 
 %------------------------------------------------------------------------------%
@@ -170,7 +177,7 @@
 find_next_state(Byte, ByteTransitions, State) :-
     Lo = array__min(ByteTransitions),
     Hi = array__max(ByteTransitions),
-    find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
+    lex__lexeme__find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
 
 
 
@@ -182,14 +189,14 @@
     ByteTransition = ByteTransitions ^ elem(Lo),
     ( if ByteTransition ^ btr_byte = Byte
       then State = ByteTransition ^ btr_state
-      else find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
+      else lex__lexeme__find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
     ).
 
 %------------------------------------------------------------------------------%
 
 in_accepting_state(CLXM) :-
     bitmap__is_set(
-        CLXM ^ clxm_transition_map ^ trm_accepting_states, CLXM ^ clxm_state
+        CLXM ^ transition_map ^ accepting_states, CLXM ^ state
     ).
 
 %------------------------------------------------------------------------------%
Index: lex.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.m,v
retrieving revision 1.1
diff -u -r1.1 lex.m
--- lex.m	2001/02/21 16:29:36	1.1
+++ lex.m	2001/08/05 16:36:40
@@ -2,15 +2,19 @@
 % lex.m
 % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sun Aug 20 09:08:46 BST 2000
+%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU Lesser General Public License.
+%
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % This module puts everything together, compiling a list of lexemes
 % into state machines and turning the input stream into a token stream.
 %
-%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-%
 %----------------------------------------------------------------------------- %
 
 :- module lex.
@@ -18,36 +22,19 @@
 :- interface.
 
 :- import_module std_util, string, char, list, io.
-
-
 
-:- type annotated_lexeme(Token)
-    ==      lexeme(annotated_token(Token)).
+:- type token_creator(Token) == (func(string)=Token).
+:- inst token_creator == (func(in)=out is det).
 
-:- type lexeme(Token)
-    --->    lexeme(
-                lxm_token               :: Token,
-                lxm_regexp              :: regexp
-            ).
+:- type lexeme(Token) == pair(regexp, token_creator(Token)).
 
-:- type annotated_token(Token)
-    --->    noval(Token)                % Just return ok(Token) on success.
-    ;       value(Token)                % Return ok(Token, String) on success.
-    ;       ignore.                     % Just skip over these tokens.
+:- inst lexeme(Inst) ---> ground - Inst.
 
 :- type lexer(Token, Source).
-:- inst lexer
-    ==      bound(lexer(ground, read_pred)).
-
-:- type lexer_state(Token, Source).
+:- inst lexer  ---> lex__lexer(ground, lex__ignore_pred, lex__read_pred).
 
-:- type lexer_result(Token)
-    --->    ok(Token)                   % Noval token matched.
-    ;       ok(Token, string)           % Value token matched.
-    ;       eof                         % End of input.
-    ;       error(int).                 % No matches for string at this offset.
 
-
+:- type lexer_state(Token, Source).
 
 :- type offset
     ==      int.                        % Byte offset into the source data.
@@ -65,28 +52,44 @@
     ==      pred(offset, read_result, T, T).
 :- inst read_pred
     ==      ( pred(in, out, di, uo) is det ).
-
 
+    % ignore_pred(Token): if it does not fail, Token must be ignored
+    %
+:- type ignore_pred(Tok)
+    ==      pred(Tok).
+:- inst ignore_pred
+    ==      ( pred(in) is semidet ).
 
     % The type of regular expressions.
     %
-:- type regexp
-    --->    null                % The empty regexp
-    ;       atom(char)          % Match a single char
-    ;       (regexp >> regexp)  % Concatenation
-    ;       (regexp \/ regexp)  % Alternation
-    ;       star(regexp)        % Kleene closure
-    .
+:- type regexp.
 
+    % The typeclass for types having a natural converter to regexp's
+    %
+:- typeclass regexp(T) where [
+           func re(T) = regexp
+].
+
+    % Handling regexp's based on the typeclass regexp(T)
+    %
+:- func null = regexp.
+:- func T1 >> T2 = regexp  <= (regexp(T1), regexp(T2)). 
+:- func T1 \/ T2 = regexp  <= (regexp(T1), regexp(T2)).
+:- func *(T1)    = regexp  <= regexp(T1).
 
+    % Some instances of typeclass regexp(T)
+    %
+:- instance regexp(regexp).
+:- instance regexp(char).
+:- instance regexp(string).
 
     % Some basic non-primitive regexps.
     %
-:- func str(string) = regexp.   % str("abc") = atom(a) >> atom(b) >> atom(c)
-:- func any(string) = regexp.   % any("abc") = atom(a) \/ atom(b) \/ atom(c)
+:- func str(string) = regexp.   % str("abc") = re(a) >> re(b) >> re(c)
+:- func any(string) = regexp.   % any("abc") = re(a) \/ re(b) \/ re(c)
 :- func anybut(string) = regexp.% anybut("abc") is complement of any("abc")
-:- func opt(regexp) = regexp.   % opt(R)     = R \/ null
-:- func plus(regexp) = regexp.  % plus(R)    = R \/ star(R)
+:- func ?(regexp) = regexp.   % ?(R)     = R \/ null
+:- func +(regexp) = regexp.  % +(R)    = R \/ *(R)
 
     % Some useful single-char regexps.
     %
@@ -105,23 +108,38 @@
 
     % Some useful compound regexps.
     %
-:- func nat = regexp.           % nat        = plus(digit)
-:- func signed_int = regexp.    % signed_int = opt(any("+-")) >> nat
+:- func nat = regexp.           % nat        = +(digit)
+:- func signed_int = regexp.    % signed_int = ?(any("+-")) >> nat
 :- func real = regexp.          % real       = \d+((.\d+([eE]int)?)|[eE]int)
-:- func identifier = regexp.    % identifier = identstart >> star(ident)
-:- func whitespace = regexp.    % whitespace = star(wspc)
-:- func junk = regexp.          % junk       = star(dot)
+:- func identifier = regexp.    % identifier = identstart >> *(ident)
+:- func whitespace = regexp.    % whitespace = *(wspc)
+:- func junk = regexp.          % junk       = *(dot)
 
 
+   % Utility function to return noval tokens.
+   % Use it in the form `return(my_token) inside a lexeme definition.
+:- func return(T,S) = T.
+
+   % Utility operator to create lexemes.
+:- func (T1 -> token_creator(Tok)) = pair(regexp,token_creator(Tok))
+	<= regexp(T1).
 
     % Construct a lexer from which we can generate running
     % instances.
     %
-:- func init(list(annotated_lexeme(Tok)), read_pred(Src)) = lexer(Tok, Src).
+:- func init(list(lexeme(Tok)), read_pred(Src)) = lexer(Tok, Src).
 :- mode init(in, in(read_pred)) = out(lexer) is det.
 
-    % Handy read predicates.
+    % Construct a lexer from which we can generate running
+    % instances. If we construct a lexer with init/4, we
+    % can additionally ignore specific tokens.
     %
+:- func init(list(lexeme(Tok)), read_pred(Src), ignore_pred(Tok)) =
+	lexer(Tok, Src).
+:- mode init(in, in(read_pred), in(ignore_pred)) = out(lexer) is det.
+
+				% Handy read predicates.
+    %
 :- pred read_from_stdin(offset, read_result, io__state, io__state).
 :- mode read_from_stdin(in, out, di, uo) is det.
 
@@ -135,8 +153,8 @@
 :- func start(lexer(Tok, Src), Src) = lexer_state(Tok, Src).
 :- mode start(in(lexer), di) = uo is det.
 
-:- pred read(lexer_result(Tok), lexer_state(Tok, Src), lexer_state(Tok, Src)).
-:- mode read(out, di, uo) is det.
+:- pred read(io__read_result(Tok), lexer_state(Tok, Src), lexer_state(Tok, Src)).
+:- mode read(out, di, uo) is cc_multi.
 
     % Stop a running instance of a lexer and retrieve the input source.
     %
@@ -171,6 +189,7 @@
 :- type lexer(Token, Source)
     --->    lexer(
                 lex_compiled_lexemes    :: list(live_lexeme(Token)),
+		lex_ignore_pred         :: ignore_pred(Token),
                 lex_buf_read_pred       :: read_pred(Source)
             ).
 
@@ -178,27 +197,45 @@
 
 :- type lexer_instance(Token, Source)
     --->    lexer_instance(
-                lexi_init_lexemes       :: list(live_lexeme(Token)),
-                lexi_live_lexemes       :: list(live_lexeme(Token)),
-                lexi_current_winner     :: winner(Token),
-                lexi_buf_state          :: buf_state(Source)
+                init_lexemes       :: list(live_lexeme(Token)),
+                live_lexemes       :: list(live_lexeme(Token)),
+                current_winner     :: winner(Token),
+                buf_state          :: buf_state(Source),
+	        ignore_pred        :: ignore_pred(Token)
             ).
 :- inst lexer_instance
-    ==      bound(lexer_instance(ground, ground, ground, buf_state)).
-
-
+    --->      lex__lexer_instance(lex__live_lexeme_list, 
+                                  lex__live_lexeme_list, 
+	                          lex__winner, 
+                                  lex__buf__buf_state,
+				  lex__ignore_pred).
 
 :- type live_lexeme(Token)
-    ==      compiled_lexeme(annotated_token(Token)).
+    ==      compiled_lexeme(Token).
+:- inst live_lexeme == compiled_lexeme(token_creator).
+:- inst live_lexeme_list == list__list_skel(live_lexeme).
 
 
 
 :- type winner(Token)
-    ==      maybe(pair(annotated_token(Token), offset)).
+    ==      maybe(pair(token_creator(Token), offset)).
+:- inst winner ---> yes(bound(token_creator - ground))
+               ;    no.
+
+%----------------------------------------------------------------------------- %
+return(Token, _) = Token.
+
+%----------------------------------------------------------------------------- %
+(R1 -> TC) = (re(R1) - TC).
 
 %----------------------------------------------------------------------------- %
+
+init(Lexemes, BufReadPred) =
+    init(Lexemes, BufReadPred, (pred(_Tok::in) is semidet :-
+			        std_util__semidet_fail)).
 
-init(Lexemes, BufReadPred) = lexer(CompiledLexemes, BufReadPred) :-
+init(Lexemes, BufReadPred, IgnorePred) =
+    lexer(CompiledLexemes, IgnorePred, BufReadPred) :-
     CompiledLexemes = list__map(lexeme__compile_lexeme, Lexemes).
 
 %----------------------------------------------------------------------------- %
@@ -215,7 +252,9 @@
 init_lexer_instance(Lexer, Instance, Buf) :-
     buf__init(Lexer ^ lex_buf_read_pred, BufState, Buf),
     InitLexemes = Lexer ^ lex_compiled_lexemes,
-    Instance    = lexer_instance(InitLexemes, InitLexemes, no, BufState).
+    IgnorePred  = Lexer ^ lex_ignore_pred,
+    Instance    = lexer_instance(InitLexemes, InitLexemes, no,
+				 BufState, IgnorePred).
 
 %----------------------------------------------------------------------------- %
 
@@ -231,19 +270,19 @@
 
 
 
-:- pred read_0(lexer_result(Tok),
+:- pred read_0(io__read_result(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
             buf, buf, Src, Src).
 :- mode read_0(out,
             in(lexer_instance), out(lexer_instance),
-            array_di, array_uo, di, uo) is det.
+            array_di, array_uo, di, uo) is cc_multi.
 
     % Basically, just read chars from the buf and advance the live lexemes
     % until we have a winner or hit an error (no parse).
     %
 read_0(Result, Instance0, Instance, Buf0, Buf, Src0, Src) :-
 
-    BufState0    = Instance0 ^ lexi_buf_state,
+    BufState0    = Instance0 ^ buf_state,
 
     buf__read(BufReadResult, BufState0, BufState1, Buf0, Buf1, Src0, Src1),
     (
@@ -259,17 +298,17 @@
 
 %----------------------------------------------------------------------------- %
 
-:- pred process_char(lexer_result(Tok), char,
+:- pred process_char(io__read_result(Tok), char,
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
             buf_state(Src), buf, buf, Src, Src).
 :- mode process_char(out, in, in(lexer_instance), out(lexer_instance),
-            in(buf_state), array_di, array_uo, di, uo) is det.
+            in(buf_state), array_di, array_uo, di, uo) is cc_multi.
 
 process_char(Result, Char, Instance0, Instance,
         BufState, Buf0, Buf, Src0, Src) :-
 
-    LiveLexemes0 = Instance0 ^ lexi_live_lexemes,
-    Winner0      = Instance0 ^ lexi_current_winner,
+    LiveLexemes0 = Instance0 ^ live_lexemes,
+    Winner0      = Instance0 ^ current_winner,
 
     advance_live_lexemes(Char, buf__cursor_offset(BufState),
             LiveLexemes0, LiveLexemes, Winner0, Winner),
@@ -282,44 +321,57 @@
         LiveLexemes = [_ | _],          % Still some open possibilities.
 
         Instance1 = (((Instance0
-                            ^ lexi_live_lexemes   := LiveLexemes)
-                            ^ lexi_current_winner := Winner)
-                            ^ lexi_buf_state      := BufState),
+                            ^ live_lexemes   := LiveLexemes)
+                            ^ current_winner := Winner)
+                            ^ buf_state      := BufState),
         read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
     ).
 
 %----------------------------------------------------------------------------- %
 
-:- pred process_any_winner(lexer_result(Tok), winner(Tok),
+:- pred process_any_winner(io__read_result(Tok), winner(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src), 
             buf_state(Src), buf, buf, Src, Src).
-:- mode process_any_winner(out, in,
+:- mode process_any_winner(out, in(winner),
             in(lexer_instance), out(lexer_instance),
-            in(buf_state), array_di, array_uo, di, uo) is det.
+            in(buf_state), array_di, array_uo, di, uo) is cc_multi.
 
-process_any_winner(Result, yes(ATok - Offset), Instance0, Instance,
+process_any_winner(Result, yes(TokenCreator - Offset), Instance0, Instance,
         BufState0, Buf0, Buf, Src0, Src) :-
 
     BufState1 = buf__rewind_cursor(Offset, BufState0),
     Instance1 = ((( Instance0
-                        ^ lexi_live_lexemes   := Instance0 ^ lexi_init_lexemes)
-                        ^ lexi_current_winner := no)
-                        ^ lexi_buf_state      := buf__commit(BufState1)),
+                        ^ live_lexemes   := Instance0 ^ init_lexemes)
+                        ^ current_winner := no)
+                        ^ buf_state      := buf__commit(BufState1)),
+    Match    = buf__string_to_cursor(BufState1, Buf0),
+    try( pred(Tok::out) is det :- Tok = TokenCreator(Match), TryRes ),
     (
-        ATok     = noval(Token),
-        Result   = ok(Token),
-        Instance = Instance1,
-        Buf      = Buf0,
-        Src      = Src0
-    ;
-        ATok     = value(Token),
-        Result   = ok(Token, buf__string_to_cursor(BufState1, Buf)),
-        Instance = Instance1,
-        Buf      = Buf0,
-        Src      = Src0
+      TryRes = succeeded(TokenCreatorRes),
+      Token = TokenCreatorRes,
+      IgnorePred = Instance0 ^ ignore_pred,
+      (
+	IgnorePred(Token)
+      ->
+	read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
+      ;
+	Result = ok(Token),
+	Instance = Instance1,
+	Buf      = Buf0,
+	Src      = Src0
+      )
     ;
-        ATok     = ignore,
-        read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
+      TryRes = exception(Exc) ,
+      (
+	std_util__univ_to_type(Exc,Mes)
+      ->
+	Result = error(Mes, Offset)
+      ;
+	rethrow(TryRes)
+      ),
+      Instance = Instance1,
+      Buf      = Buf0,
+      Src      = Src0
     ).
 
 process_any_winner(Result, no, Instance0, Instance,
@@ -327,16 +379,16 @@
 
     Start     = buf__start_offset(BufState0),
     BufState1 = buf__rewind_cursor(Start + 1, BufState0),
-    Result    = error(Start),
+    Result    = error("input not matched by any regexp", Start),
     Instance  = ((( Instance0
-                        ^ lexi_live_lexemes   :=
-                                Instance0 ^ lexi_init_lexemes)
-                        ^ lexi_current_winner := no)
-                        ^ lexi_buf_state      := buf__commit(BufState1)).
+                        ^ live_lexemes   :=
+                                Instance0 ^ init_lexemes)
+                        ^ current_winner := no)
+                        ^ buf_state      := buf__commit(BufState1)).
 
 %----------------------------------------------------------------------------- %
 
-:- pred process_eof(lexer_result(Tok),
+:- pred process_eof(io__read_result(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
             buf_state(Src), buf).
 :- mode process_eof(out, in(lexer_instance), out(lexer_instance),
@@ -346,44 +398,47 @@
 
     ( if
 
-        live_lexeme_in_accepting_state(Instance0 ^ lexi_live_lexemes, ATok)
+        live_lexeme_in_accepting_state(Instance0 ^ live_lexemes, TokenCreator)
 
       then
 
             % Return the token and set things up so that we return
             % eof next.
-        (
-            ATok   = noval(Token),
-            Result = ok(Token)
-        ;
-            ATok   = value(Token),
-            Result = ok(Token, buf__string_to_cursor(BufState, Buf))
-        ;
-            ATok   = ignore,
-            Result = eof
-        )
+            % handle exceptions
+            Match  = buf__string_to_cursor(BufState, Buf),
+            Token = TokenCreator(Match),
+            IgnorePred = Instance0 ^ ignore_pred,
+	    (
+	      IgnorePred(Token)
+	    ->
+	      Result = eof
+	    ;
+	      Result = ok(Token)
+	    )
 
       else
 
         Result     = eof
     ),
     Instance  = ((Instance0
-                        ^ lexi_live_lexemes := [])
-                        ^ lexi_buf_state    := buf__commit(BufState)).
+                        ^ live_lexemes := [])
+                        ^ buf_state    := buf__commit(BufState)).
 
 %----------------------------------------------------------------------------- %
 
 :- pred advance_live_lexemes(char, offset,
             list(live_lexeme(Token)), list(live_lexeme(Token)),
             winner(Token), winner(Token)).
-:- mode advance_live_lexemes(in, in, in, out, in, out) is det.
+:- mode advance_live_lexemes(in, in, in(live_lexeme_list), 
+                             out(live_lexeme_list), 
+			     in(winner), out(winner)) is det.
 
 advance_live_lexemes(_Char, _Offset, [], [], Winner, Winner).
 
 advance_live_lexemes(Char, Offset, [L0 | Ls0], Ls, Winner0, Winner) :-
 
-    State0        = L0 ^ clxm_state,
-    ATok          = L0 ^ clxm_token,
+    State0        = L0 ^ state,
+    ATok          = L0 ^ token,
 
     ( if next_state(L0, State0, Char, State, IsAccepting) then
 
@@ -395,7 +450,7 @@
             Winner1     = yes(ATok - Offset)
         ),
         advance_live_lexemes(Char, Offset, Ls0, Ls1, Winner1, Winner),
-        Ls = [( L0 ^ clxm_state := State ) | Ls1]
+        Ls = [( L0 ^ state := State ) | Ls1]
 
       else
 
@@ -405,15 +460,17 @@
 %----------------------------------------------------------------------------- %
 
 :- pred live_lexeme_in_accepting_state(list(live_lexeme(Tok)),
-                annotated_token(Tok)).
-:- mode live_lexeme_in_accepting_state(in, out) is semidet.
+                token_creator(Tok)).
+:- mode live_lexeme_in_accepting_state(in(live_lexeme_list), 
+	                  out(token_creator)) is semidet.
 
 live_lexeme_in_accepting_state([L | Ls], Token) :-
-    ( if in_accepting_state(L)
-      then Token = L ^ clxm_token
+    ( if lex__lexeme__in_accepting_state(L)
+      then Token = L ^ token
       else live_lexeme_in_accepting_state(Ls, Token)
     ).
 
+
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
 
@@ -471,6 +528,39 @@
         Result = eof
     ).
 
+
+%----------------------------------------------------------------------------- %
+% The type of regular expressions.
+:- type regexp
+    --->    re_null                   % The empty regexp
+    ;       re_atom(char)             % Match a single char
+    ;       re_concat(regexp,regexp)  % Concatenation
+    ;       re_or(regexp, regexp)     % Alternation
+    ;       re_star(regexp)           % Kleene closure
+    .
+
+%----------------------------------------------------------------------------- %
+% Some instances of typeclass regexp(T)
+:- instance regexp(regexp) where [
+	  re(RE) = RE
+].
+
+:- instance regexp(char) where [
+	  re(C) = re_atom(C)
+].
+
+:- instance regexp(string) where [
+	  re(S) = str(S)
+].
+
+
+%----------------------------------------------------------------------------- %
+% Basic primitive regexps.
+null = re_null.
+R1 >> R2 = re_concat(re(R1), re(R2)).
+R1 \/ R2 = re_or(re(R1), re(R2)).
+*(R1)    = re_star(re(R1)).
+
 %----------------------------------------------------------------------------- %
 % Some basic non-primitive regexps.
 
@@ -480,7 +570,7 @@
       else
         L = string__length(S),
         C = string__index_det(S, L - 1),
-        R = str_foldr(func(Cx, Rx) = (atom(Cx) >> Rx), S, atom(C), L - 2)
+        R = str_foldr(func(Cx, Rx) = (re(Cx) >> Rx), S, re(C), L - 2)
     ).
 
 any(S) = R :-
@@ -489,7 +579,7 @@
       else
         L = string__length(S),
         C = string__index_det(S, L - 1),
-        R = str_foldr(func(Cx, Rx) = (atom(Cx) \/ Rx), S, atom(C), L - 2)
+        R = str_foldr(func(Cx, Rx) = (re(Cx) \/ Rx), S, re(C), L - 2)
     ).
 
 anybut(S0) = R :-
@@ -511,9 +601,9 @@
                else str_foldr(Fn, S, Fn(string__index_det(S, I), X), I - 1)
     ).
 
-opt(R)  = (R \/ null).
+?(R)  = (R \/ null).
 
-plus(R) = (R >> star(R)).
++(R) = (R >> *(R)).
 
 %----------------------------------------------------------------------------- %
 % Some useful single-char regexps.
@@ -535,24 +625,24 @@
 dot        = anybut("\n").
 alpha      = (lower \/ upper).
 alphanum   = (alpha \/ digit).
-identstart = (alpha \/ atom('_')).
-ident      = (alphanum \/ atom('_')).
-nl         = atom('\n').
-tab        = atom('\t').
-spc        = atom(' ').
+identstart = (alpha \/ re('_')).
+ident      = (alphanum \/ re('_')).
+nl         = re('\n').
+tab        = re('\t').
+spc        = re(' ').
 
 %----------------------------------------------------------------------------- %
 % Some useful compound regexps.
 
-nat        = plus(digit).
-signed_int = (opt(any("+-")) >> nat).
+nat        = +(digit).
+signed_int = (?(any("+-")) >> nat).
 real       = (signed_int >> (
-                (atom('.') >> nat >> opt(any("eE") >> signed_int)) \/
+                (re('.') >> nat >> ?(any("eE") >> signed_int)) \/
                 (any("eE") >> signed_int)
              )).
-identifier = (identstart >> star(ident)).
-whitespace = star(wspc).
-junk       = star(dot).
+identifier = (identstart >> *(ident)).
+whitespace = *(wspc).
+junk       = *(dot).
 
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
Index: lex.regexp.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.regexp.m,v
retrieving revision 1.1
diff -u -r1.1 lex.regexp.m
--- lex.regexp.m	2001/02/21 16:29:37	1.1
+++ lex.regexp.m	2001/08/05 16:36:40
@@ -1,16 +1,20 @@
 %----------------------------------------------------------------------------- %
 % lex.regexp.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Fri Aug 18 06:43:09 BST 2000
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
+%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU Lesser General Public License.
+%
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % Converts basic regular expressions into non-deterministic finite
 % automata (NFAs).
 %
-%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-%
 %----------------------------------------------------------------------------- %
 
 :- module lex__regexp.
@@ -53,20 +57,20 @@
 
     % The primitive regexps.
 
-compile(X, null,       Y, [null(X, Y)]) --> [].
+compile(X, re_null,       Y, [null(X, Y)]) --> [].
 
-compile(X, atom(C),    Y, [trans(X, C, Y)]) --> [].
+compile(X, re_atom(C),    Y, [trans(X, C, Y)]) --> [].
 
-compile(X, (RA >> RB), Y, TsA ++ TsB) -->
+compile(X, re_concat(RA,RB), Y, TsA ++ TsB) -->
     counter__allocate(Z),
     compile(X, RA, Z, TsA),
     compile(Z, RB, Y, TsB).
 
-compile(X, (RA \/ RB), Y, TsA ++ TsB) -->
+compile(X, re_or(RA,RB), Y, TsA ++ TsB) -->
     compile(X, RA, Y, TsA),
     compile(X, RB, Y, TsB).
 
-compile(X, star(R),    Y, TsA ++ TsB) -->
+compile(X, re_star(R),    Y, TsA ++ TsB) -->
     compile(X, null, Y, TsA),
     compile(X, R,    X, TsB).
 
Index: samples/demo.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/samples/demo.m,v
retrieving revision 1.1
diff -u -r1.1 demo.m
--- samples/demo.m	2001/02/21 16:29:39	1.1
+++ samples/demo.m	2001/08/05 16:36:40
@@ -1,12 +1,17 @@
 %----------------------------------------------------------------------------- %
 % demo.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sun Aug 20 18:11:42 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
-%
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
 %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
 %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU General Public License - see the file COPYING in the
+%   Mercury Distribution.
+%
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
 %
 %----------------------------------------------------------------------------- %
 
@@ -16,14 +21,14 @@
 
 :- import_module io.
 
-:- pred main(io__state::di, io__state::uo) is det.
+:- pred main(io__state::di, io__state::uo) is cc_multi.
 
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
 
 :- implementation.
 
-:- import_module char, string, exception, array, list.
+:- import_module char, string, exception, array, list, std_util, int.
 :- import_module lex.
 
 %----------------------------------------------------------------------------- %
@@ -38,10 +43,18 @@
 
 Try me...
 
+I don't recognise any `pragma', nevertheless try it, because I give
+you a nice error message. Even more, I don't recognise any `module',
+too, but the error message is not so nice.
+
+... Now I'm even able to count numbers. It's great, isn't it. But not
+to big ones, anywhere between 10 and 20 I lose the thread. Do not
+expect to much of me.
+
 "),
 
-    { Lexer  = lex__init(lexemes, lex__read_from_stdin) },
-    call((pred(IO0::di, IO::uo) is det :-
+    { Lexer  = lex__init(lexemes, lex__read_from_stdin, ignore_space) },
+    call((pred(IO0::di, IO::uo) is cc_multi :-
             State0 = lex__start(Lexer, IO0),
             tokenise_stdin(State0, State),
             IO     = lex__stop(State)
@@ -49,9 +62,14 @@
 
 %----------------------------------------------------------------------------- %
 
+:- pred ignore_space(token::in) is semidet.
+ignore_space(space).
+
+%----------------------------------------------------------------------------- %
+
 :- pred tokenise_stdin(lexer_state(token, io__state),
                 lexer_state(token, io__state)).
-:- mode tokenise_stdin(di, uo) is det.
+:- mode tokenise_stdin(di, uo) is cc_multi.
 
 tokenise_stdin -->
     lex__read(Result),
@@ -66,48 +84,67 @@
 %----------------------------------------------------------------------------- %
 
 :- type token
-    --->    noun
-    ;       comment
-    ;       integer
-    ;       real
-    ;       verb
-    ;       conj
-    ;       prep
+    --->    noun(string)
+    ;       comment(string)
+    ;       integer(int)
+    ;       real(float)
+    ;       verb(string)
+    ;       conj(string)
+    ;       prep(string)
     ;       punc
+    ;       space
+    ;       small_num(int)
     .
 
-:- func lexemes = list(annotated_lexeme(token)).
+:- func lexemes = list(lexeme(token)).
 
 lexemes = [
-    lexeme(value(comment),
-			(atom('%') >> junk)),
-
-    lexeme(value(integer),
-			(signed_int)),
-
-    lexeme(value(real),
-            (real)),
-
-    lexeme(value(noun), str("cat")),
-    lexeme(value(noun), str("dog")),
-    lexeme(value(noun), str("rat")),
-    lexeme(value(noun), str("mat")),
-
-    lexeme(value(verb),
-			(str("sat") \/ str("caught") \/ str("chased"))),
-
-    lexeme(value(conj),
-			(str("and") \/ str("then"))),
+    ("%" >> junk)        -  (func(Match) = comment(Match)) ,
+    (signed_int)         -  (func(Match) = integer(string__det_to_int(Match))),
+    (real)               -  (func(Match) = real(det_string_to_float(Match))) ,
+    re("cat")            - (func(Match) = noun(Match)) ,
+    re("dog")            -  (func(Match) = noun(Match)) ,
+    % Using `->' instead of `-' it's not necessary to call `re':
+    "rat"                -> (func(Match) = noun(Match)) ,
+    "mat"                -> (func(Match) = noun(Match)) ,
+    ("sat" \/ "caught" \/ "chased") -
+	                    (func(Match) = verb(Match)) ,
+    ("and" \/ "then")    -
+	                    (func(Match) = conj(Match)) ,
+    ("the" \/ "it" \/ "them" \/ "to" \/ "on") -
+	                    (func(Match) = prep(Match)) ,
+    any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./") -
+                            return(punc) ,
+	
+    % a exception of string type, which is caught by lex:
+    re("pragma")         -  (func(Match) = noun(Match) :-
+                            throw("No pragma, please, I am not Mercury !")),
+	
+    % another exception not caught by lex, because of wrong type:
+    re("module")        -  (func(Match) = noun(Match) :-
+                            throw(0815)),
+
+    whitespace           -  return(space),
+
+    nat                  -  (func(Match) = small_num(Num) :-
+			        Num = string__det_to_int(Match),
+				(
+				  Num < 14
+				;
+				  throw("Hey, this number was really to big.")
+				))
+].
 
-    lexeme(value(prep),
-			(str("the") \/ str("it") \/ str("them") \/ str("to") \/ str("on"))),
+:- func det_string_to_float(string) = float.
 
-    lexeme(noval(punc),
-			(any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./"))),
+:- import_module require.
 
-    lexeme(ignore,
-			whitespace)
-].
+det_string_to_float(String) = Float :-
+	string__to_float(String,Float) 
+        ; 
+	error("Wrong regular expression chosen for token `real'").
 
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
+
+


More information about the reviews mailing list