[m-rev.] lex modified

Holger Krug hkrug at rationalizer.com
Thu Jul 26 18:19:05 AEST 2001
Previous message: [m-rev.] diff: string__sub_string_search in MC++
Next message: [m-rev.] lex modified
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
    * the value contained in a token now may belong to an arbitrary 
      Mercury type, not only string as before
    * lexemes now may contain a function to compute an arbitrary token
      value from the string matched
    * the lexeme definition by the user now is more difficult,
      because the user has to name higher order functions
    * the advantage is that no postprocessing is needed to evaluate
      the string, hence the interface to the consumer of the tokens is 
      cleaner

The problem was that I had to copy some predicates from lex.lexeme.m
to lex.m because I otherwise couldn't cope with the inst problem
mentioned in my mail "strange mmc behavior". 

-- 
Holger Krug
hkrug at rationalizer.com
-------------- next part --------------
Index: README
===================================================================
RCS file: /usr/local/dev/cvs/rat/rationality/logic/mercury/extras/lex/README,v
retrieving revision 1.1
retrieving revision 1.3
diff -u -r1.1 -r1.3
--- README	2001/07/13 09:33:50	1.1
+++ README	2001/07/26 08:11:44	1.3
@@ -1,11 +1,29 @@
 lex 1.0 (very alpha)
 Fri Aug 25 17:54:28  2000
 Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
-
     THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
     BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
     BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-
+Thu Jul 26 07:45:47 UTC 2001
+Copyright (C) 2001 The Rationalizer Intelligent Software AG
+    The changes made by Rationalizer are contributed under the terms 
+    of the GNU General Public License - see the file COPYING in the
+    Mercury Distribution.
+
+%----------------------------------------------------------------------------%
+%
+% 07/26/01 hkrug at rationalizer.com:
+%    * the value contained in a token now may belong to an arbitrary 
+%      Mercury type, not only string as before
+%    * lexemes now may contain a function to compute an arbitrary token
+%      value from the string matched
+%    * the lexeme definition by the user now is more difficult,
+%      because the user has to name higher order functions
+%    * the advantage is that no postprocessing is needed to evaluate
+%      the string, hence the interface to the consumer of the tokens is 
+%      cleaner
+%
+%----------------------------------------------------------------------------%
 
 
 This package defines a lexer for Mercury.  There is plenty of scope
@@ -23,20 +41,30 @@
 
     :- type token
         --->    comment
-        ;       id
-        ;       num.
+        ;       id(string)
+        ;       num(int).
 
 3. Set up a list of annotated_lexemes.
 
     Lexemes = [
-        lexeme(noval(comment),      (atom('%'), star(dot))),
-        lexeme(value(id),           identifier),
-        lexeme(ignore,              whitespace)
+        lexeme( noval(comment),
+                (atom('%'), star(dot))
+	      ),
+        lexeme( t( func(Match) = id(Match) ),
+	        identifier
+              ),
+        lexeme( ignore ,
+	        whitespace
+              )
     ]
 
-noval tokens are simply identified;
-value tokens are identified and returned with the string matched;
-ignore regexps are simply passed over.
+Lexemes are pairs. Their first entry is one of:
+* noval(FixedToken)
+* t(TokenCreator)
+* ignore
+where `FixedToken' is a fixed token value and `TokenCreator' is a function
+to compute tokens from strings, namely from the string matched by the
+regular expression forming the second part of the lexeme.
 
 4. Set up a lexer with an appropriate read predicate (see the buf module).
 
Index: lex.lexeme.m
===================================================================
RCS file: /usr/local/dev/cvs/rat/rationality/logic/mercury/extras/lex/lex.lexeme.m,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- lex.lexeme.m	2001/07/13 09:33:53	1.1
+++ lex.lexeme.m	2001/07/26 08:07:10	1.2
@@ -1,16 +1,22 @@
 %-----------------------------------------------------------------------------
 % lex.lexeme.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sat Aug 19 08:22:32 BST 2000
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
+%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU General Public License - see the file COPYING in the
+%   Mercury Distribution.
+%
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % A lexeme combines a token with a regexp.  The lexer compiles
 % lexemes and returns the longest successul parse in the input
 % stream or an error if no match occurs.
 %
-%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
 %
 %----------------------------------------------------------------------------- %
 
@@ -26,6 +32,8 @@
                 clxm_state              :: state_no,
                 clxm_transition_map     :: transition_map
             ).
+:- inst compiled_lexeme(Inst) ==
+            bound(compiled_lexeme(Inst,ground,ground)).
 
 :- type transition_map
     --->    transition_map(
@@ -59,12 +67,12 @@
     % an accepting state_no.
     %
 :- pred next_state(compiled_lexeme(T), state_no, char, state_no, bool).
-:- mode next_state(in, in, in, out, out) is semidet.
+:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
 
     % Succeeds iff a compiled_lexeme is in an accepting state_no.
     %
-:- pred in_accepting_state(compiled_lexeme(T)).
-:- mode in_accepting_state(in) is semidet.
+:- pred in_accepting_state(live_lexeme(T)).
+:- mode in_accepting_state(in(live_lexeme)) is semidet.
 
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
@@ -159,7 +167,7 @@
 next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
     Rows            = CLXM ^ clxm_transition_map ^ trm_rows,
     AcceptingStates = CLXM ^ clxm_transition_map ^ trm_accepting_states,
-    find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
+    lex__lexeme__find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
     IsAccepting     = bitmap__get(AcceptingStates, NextState).
 
 %------------------------------------------------------------------------------%
@@ -170,7 +178,7 @@
 find_next_state(Byte, ByteTransitions, State) :-
     Lo = array__min(ByteTransitions),
     Hi = array__max(ByteTransitions),
-    find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
+    lex__lexeme__find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
 
 
 
@@ -182,7 +190,7 @@
     ByteTransition = ByteTransitions ^ elem(Lo),
     ( if ByteTransition ^ btr_byte = Byte
       then State = ByteTransition ^ btr_state
-      else find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
+      else lex__lexeme__find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
     ).
 
 %------------------------------------------------------------------------------%
Index: lex.m
===================================================================
RCS file: /usr/local/dev/cvs/rat/rationality/logic/mercury/extras/lex/lex.m,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- lex.m	2001/07/13 09:33:53	1.1
+++ lex.m	2001/07/26 08:07:11	1.2
@@ -2,15 +2,20 @@
 % lex.m
 % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sun Aug 20 09:08:46 BST 2000
+%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU General Public License - see the file COPYING in the
+%   Mercury Distribution.
+%
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % This module puts everything together, compiling a list of lexemes
 % into state machines and turning the input stream into a token stream.
 %
-%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-%
 %----------------------------------------------------------------------------- %
 
 :- module lex.
@@ -24,31 +29,36 @@
 :- type annotated_lexeme(Token)
     ==      lexeme(annotated_token(Token)).
 
+:- type token_creator(Token) == (func(string)=Token).
+:- inst token_creator == (func(in)=out is det).
+
 :- type lexeme(Token)
     --->    lexeme(
                 lxm_token               :: Token,
                 lxm_regexp              :: regexp
             ).
+:- inst lexeme(Inst) == bound(lexeme(Inst,ground)).
 
 :- type annotated_token(Token)
-    --->    noval(Token)                % Just return ok(Token) on success.
-    ;       value(Token)                % Return ok(Token, String) on success.
-    ;       ignore.                     % Just skip over these tokens.
+    --->    t(token_creator(Token))   % Return ok(apply(TokenCreator, Match))
+                                      % on success.
+    ;       noval(Token)              % Just return ok(Match) on success
+    ;       ignore.                   % Just skip over these tokens.
 
+:- inst annotated_token == bound( t(token_creator); noval(ground); ignore).
+
 :- type lexer(Token, Source).
-:- inst lexer
-    ==      bound(lexer(ground, read_pred)).
+:- inst lexer  ==  bound(lexer(ground, read_pred)).
+
 
 :- type lexer_state(Token, Source).
 
 :- type lexer_result(Token)
-    --->    ok(Token)                   % Noval token matched.
-    ;       ok(Token, string)           % Value token matched.
+    --->    ok(Token)                   % Token matched.
     ;       eof                         % End of input.
     ;       error(int).                 % No matches for string at this offset.
 
 
-
 :- type offset
     ==      int.                        % Byte offset into the source data.
 
@@ -184,17 +194,19 @@
                 lexi_buf_state          :: buf_state(Source)
             ).
 :- inst lexer_instance
-    ==      bound(lexer_instance(ground, ground, ground, buf_state)).
-
-
+    ==      bound(lexer_instance(live_lexeme_list, live_lexeme_list, 
+	                         winner, buf_state)).
 
 :- type live_lexeme(Token)
     ==      compiled_lexeme(annotated_token(Token)).
+:- inst live_lexeme == compiled_lexeme(annotated_token).
+:- inst live_lexeme_list == list__list_skel(live_lexeme).
 
 
 
 :- type winner(Token)
     ==      maybe(pair(annotated_token(Token), offset)).
+:- inst winner == bound( yes(bound(annotated_token - ground)); no).
 
 %----------------------------------------------------------------------------- %
 
@@ -293,7 +305,7 @@
 :- pred process_any_winner(lexer_result(Tok), winner(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src), 
             buf_state(Src), buf, buf, Src, Src).
-:- mode process_any_winner(out, in,
+:- mode process_any_winner(out, in(winner),
             in(lexer_instance), out(lexer_instance),
             in(buf_state), array_di, array_uo, di, uo) is det.
 
@@ -306,14 +318,15 @@
                         ^ lexi_current_winner := no)
                         ^ lexi_buf_state      := buf__commit(BufState1)),
     (
-        ATok     = noval(Token),
-        Result   = ok(Token),
+        ATok     = t(TokenCreator),
+        Match    = buf__string_to_cursor(BufState1, Buf),
+        Result   = ok(apply(TokenCreator, Match)),
         Instance = Instance1,
         Buf      = Buf0,
         Src      = Src0
     ;
-        ATok     = value(Token),
-        Result   = ok(Token, buf__string_to_cursor(BufState1, Buf)),
+        ATok     = noval(Token),
+        Result   = ok(Token),
         Instance = Instance1,
         Buf      = Buf0,
         Src      = Src0
@@ -353,12 +366,13 @@
             % Return the token and set things up so that we return
             % eof next.
         (
+            ATok   = t(TokenCreator),
+            Match  = buf__string_to_cursor(BufState, Buf),
+            Result = ok(apply(TokenCreator, Match))
+        ;
             ATok   = noval(Token),
             Result = ok(Token)
         ;
-            ATok   = value(Token),
-            Result = ok(Token, buf__string_to_cursor(BufState, Buf))
-        ;
             ATok   = ignore,
             Result = eof
         )
@@ -376,7 +390,9 @@
 :- pred advance_live_lexemes(char, offset,
             list(live_lexeme(Token)), list(live_lexeme(Token)),
             winner(Token), winner(Token)).
-:- mode advance_live_lexemes(in, in, in, out, in, out) is det.
+:- mode advance_live_lexemes(in, in, in(live_lexeme_list), 
+                             out(live_lexeme_list), 
+			     in(winner), out(winner)) is det.
 
 advance_live_lexemes(_Char, _Offset, [], [], Winner, Winner).
 
@@ -385,7 +401,7 @@
     State0        = L0 ^ clxm_state,
     ATok          = L0 ^ clxm_token,
 
-    ( if next_state(L0, State0, Char, State, IsAccepting) then
+    ( if lex__next_state(L0, State0, Char, State, IsAccepting) then
 
         (
             IsAccepting = no,
@@ -406,14 +422,16 @@
 
 :- pred live_lexeme_in_accepting_state(list(live_lexeme(Tok)),
                 annotated_token(Tok)).
-:- mode live_lexeme_in_accepting_state(in, out) is semidet.
+:- mode live_lexeme_in_accepting_state(in(live_lexeme_list), 
+	                  out(annotated_token)) is semidet.
 
 live_lexeme_in_accepting_state([L | Ls], Token) :-
-    ( if in_accepting_state(L)
+    ( if lex__in_accepting_state(L)
       then Token = L ^ clxm_token
       else live_lexeme_in_accepting_state(Ls, Token)
     ).
 
+
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
 
@@ -556,3 +574,41 @@
 
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
+
+:- import_module bitmap.
+:- pred in_accepting_state(live_lexeme(T)).
+:- mode in_accepting_state(in(live_lexeme)) is semidet.
+in_accepting_state(CLXM) :-
+    bitmap__is_set(
+        CLXM ^ clxm_transition_map ^ trm_accepting_states, CLXM ^ clxm_state
+    ).
+
+
+:- pred next_state(live_lexeme(T), state_no, char, state_no, bool).
+:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
+
+next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
+    Rows            = CLXM ^ clxm_transition_map ^ trm_rows,
+    AcceptingStates = CLXM ^ clxm_transition_map ^ trm_accepting_states,
+    lex__find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
+    IsAccepting     = bitmap__get(AcceptingStates, NextState).
+
+:- pred find_next_state(int, array(byte_transition), state_no).
+:- mode find_next_state(in, in, out) is semidet.
+
+find_next_state(Byte, ByteTransitions, State) :-
+    Lo = array__min(ByteTransitions),
+    Hi = array__max(ByteTransitions),
+    lex__find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
+
+
+:- pred find_next_state_0(int, int, int, array(byte_transition), state_no).
+:- mode find_next_state_0(in, in, in, in, out) is semidet.
+
+find_next_state_0(Lo, Hi, Byte, ByteTransitions, State) :-
+    Lo =< Hi,
+    ByteTransition = ByteTransitions ^ elem(Lo),
+    ( if ByteTransition ^ btr_byte = Byte
+      then State = ByteTransition ^ btr_state
+      else lex__find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
+    ).
Index: samples/demo.m
===================================================================
RCS file: /usr/local/dev/cvs/rat/rationality/logic/mercury/extras/lex/samples/demo.m,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- samples/demo.m	2001/07/13 09:33:56	1.1
+++ samples/demo.m	2001/07/26 08:07:46	1.2
@@ -1,12 +1,17 @@
 %----------------------------------------------------------------------------- %
 % demo.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Sun Aug 20 18:11:42 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
-%
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
 %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
 %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+%   The changes made by Rationalizer are contributed under the terms 
+%   of the GNU General Public License - see the file COPYING in the
+%   Mercury Distribution.
+%
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
 %
 %----------------------------------------------------------------------------- %
 
@@ -66,48 +71,58 @@
 %----------------------------------------------------------------------------- %
 
 :- type token
-    --->    noun
-    ;       comment
-    ;       integer
-    ;       real
-    ;       verb
-    ;       conj
-    ;       prep
+    --->    noun(string)
+    ;       comment(string)
+    ;       integer(int)
+    ;       real(float)
+    ;       verb(string)
+    ;       conj(string)
+    ;       prep(string)
     ;       punc
     .
 
 :- func lexemes = list(annotated_lexeme(token)).
 
 lexemes = [
-    lexeme(value(comment),
-			(atom('%') >> junk)),
+%    lexeme(ignore, (atom('%') >> junk)),
+     lexeme( t(func(Match) = comment(Match)), (atom('%') >> junk)),
+%    lexeme( t(comment), (atom('%') >> junk)),
 
-    lexeme(value(integer),
+    lexeme( t(func(Match) = integer(string__det_to_int(Match))),
 			(signed_int)),
 
-    lexeme(value(real),
-            (real)),
+    lexeme( t(func(Match) = real(det_string_to_float(Match))), (real)),
 
-    lexeme(value(noun), str("cat")),
-    lexeme(value(noun), str("dog")),
-    lexeme(value(noun), str("rat")),
-    lexeme(value(noun), str("mat")),
+    lexeme( t(func(Match) = noun(Match)), str("cat")),
+    lexeme( t(func(Match) = noun(Match)), str("dog")),
+    lexeme( t(func(Match) = noun(Match)), str("rat")),
+    lexeme( t(func(Match) = noun(Match)), str("mat")),
 
-    lexeme(value(verb),
+    lexeme( t(func(Match) = verb(Match)),
 			(str("sat") \/ str("caught") \/ str("chased"))),
 
-    lexeme(value(conj),
+    lexeme( t(func(Match) = conj(Match)),
 			(str("and") \/ str("then"))),
 
-    lexeme(value(prep),
+    lexeme( t(func(Match) = prep(Match)),
 			(str("the") \/ str("it") \/ str("them") \/ str("to") \/ str("on"))),
 
     lexeme(noval(punc),
 			(any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./"))),
 
-    lexeme(ignore,
-			whitespace)
+    lexeme(ignore, whitespace)
 ].
 
+:- func det_string_to_float(string) = float.
+
+:- import_module require.
+
+det_string_to_float(String) = Float :-
+	string__to_float(String,Float) 
+        ; 
+	error("Wrong regular expression chosen for token `real'").
+
 %----------------------------------------------------------------------------- %
 %----------------------------------------------------------------------------- %
+
+
Previous message: [m-rev.] diff: string__sub_string_search in MC++
Next message: [m-rev.] lex modified
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the reviews mailing list