[m-rev.] lex modified
Holger Krug
hkrug at rationalizer.com
Thu Jul 26 18:19:05 AEST 2001
* the value contained in a token now may belong to an arbitrary
Mercury type, not only string as before
* lexemes now may contain a function to compute an arbitrary token
value from the string matched
* the lexeme definition by the user now is more difficult,
because the user has to name higher order functions
* the advantage is that no postprocessing is needed to evaluate
the string, hence the interface to the consumer of the tokens is
cleaner
The problem was that I had to copy some predicates from lex.lexeme.m
to lex.m because I otherwise couldn't cope with the inst problem
mentioned in my mail "strange mmc behavior".
--
Holger Krug
hkrug at rationalizer.com
-------------- next part --------------
Index: README
===================================================================
RCS file: /usr/local/dev/cvs/rat/rationality/logic/mercury/extras/lex/README,v
retrieving revision 1.1
retrieving revision 1.3
diff -u -r1.1 -r1.3
--- README 2001/07/13 09:33:50 1.1
+++ README 2001/07/26 08:11:44 1.3
@@ -1,11 +1,29 @@
lex 1.0 (very alpha)
Fri Aug 25 17:54:28 2000
Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
-
THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-
+Thu Jul 26 07:45:47 UTC 2001
+Copyright (C) 2001 The Rationalizer Intelligent Software AG
+ The changes made by Rationalizer are contributed under the terms
+ of the GNU General Public License - see the file COPYING in the
+ Mercury Distribution.
+
+%----------------------------------------------------------------------------%
+%
+% 07/26/01 hkrug at rationalizer.com:
+% * the value contained in a token now may belong to an arbitrary
+% Mercury type, not only string as before
+% * lexemes now may contain a function to compute an arbitrary token
+% value from the string matched
+% * the lexeme definition by the user now is more difficult,
+% because the user has to name higher order functions
+% * the advantage is that no postprocessing is needed to evaluate
+% the string, hence the interface to the consumer of the tokens is
+% cleaner
+%
+%----------------------------------------------------------------------------%
This package defines a lexer for Mercury. There is plenty of scope
@@ -23,20 +41,30 @@
:- type token
---> comment
- ; id
- ; num.
+ ; id(string)
+ ; num(int).
3. Set up a list of annotated_lexemes.
Lexemes = [
- lexeme(noval(comment), (atom('%'), star(dot))),
- lexeme(value(id), identifier),
- lexeme(ignore, whitespace)
+ lexeme( noval(comment),
+ (atom('%'), star(dot))
+ ),
+ lexeme( t( func(Match) = id(Match) ),
+ identifier
+ ),
+ lexeme( ignore ,
+ whitespace
+ )
]
-noval tokens are simply identified;
-value tokens are identified and returned with the string matched;
-ignore regexps are simply passed over.
+Lexemes are pairs. Their first entry is one of:
+* noval(FixedToken)
+* t(TokenCreator)
+* ignore
+where `FixedToken' is a fixed token value and `TokenCreator' is a function
+to compute tokens from strings, namely from the string matched by the
+regular expression forming the second part of the lexeme.
4. Set up a lexer with an appropriate read predicate (see the buf module).
Index: lex.lexeme.m
===================================================================
RCS file: /usr/local/dev/cvs/rat/rationality/logic/mercury/extras/lex/lex.lexeme.m,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- lex.lexeme.m 2001/07/13 09:33:53 1.1
+++ lex.lexeme.m 2001/07/26 08:07:10 1.2
@@ -1,16 +1,22 @@
%-----------------------------------------------------------------------------
% lex.lexeme.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% Sat Aug 19 08:22:32 BST 2000
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
+% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+% The changes made by Rationalizer are contributed under the terms
+% of the GNU General Public License - see the file COPYING in the
+% Mercury Distribution.
+%
% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
%
% A lexeme combines a token with a regexp. The lexer compiles
% lexemes and returns the longest successul parse in the input
% stream or an error if no match occurs.
%
-% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
%
%----------------------------------------------------------------------------- %
@@ -26,6 +32,8 @@
clxm_state :: state_no,
clxm_transition_map :: transition_map
).
+:- inst compiled_lexeme(Inst) ==
+ bound(compiled_lexeme(Inst,ground,ground)).
:- type transition_map
---> transition_map(
@@ -59,12 +67,12 @@
% an accepting state_no.
%
:- pred next_state(compiled_lexeme(T), state_no, char, state_no, bool).
-:- mode next_state(in, in, in, out, out) is semidet.
+:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
% Succeeds iff a compiled_lexeme is in an accepting state_no.
%
-:- pred in_accepting_state(compiled_lexeme(T)).
-:- mode in_accepting_state(in) is semidet.
+:- pred in_accepting_state(live_lexeme(T)).
+:- mode in_accepting_state(in(live_lexeme)) is semidet.
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
@@ -159,7 +167,7 @@
next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
Rows = CLXM ^ clxm_transition_map ^ trm_rows,
AcceptingStates = CLXM ^ clxm_transition_map ^ trm_accepting_states,
- find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
+ lex__lexeme__find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
IsAccepting = bitmap__get(AcceptingStates, NextState).
%------------------------------------------------------------------------------%
@@ -170,7 +178,7 @@
find_next_state(Byte, ByteTransitions, State) :-
Lo = array__min(ByteTransitions),
Hi = array__max(ByteTransitions),
- find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
+ lex__lexeme__find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
@@ -182,7 +190,7 @@
ByteTransition = ByteTransitions ^ elem(Lo),
( if ByteTransition ^ btr_byte = Byte
then State = ByteTransition ^ btr_state
- else find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
+ else lex__lexeme__find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
).
%------------------------------------------------------------------------------%
Index: lex.m
===================================================================
RCS file: /usr/local/dev/cvs/rat/rationality/logic/mercury/extras/lex/lex.m,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- lex.m 2001/07/13 09:33:53 1.1
+++ lex.m 2001/07/26 08:07:11 1.2
@@ -2,15 +2,20 @@
% lex.m
% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% Sun Aug 20 09:08:46 BST 2000
+% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+% The changes made by Rationalizer are contributed under the terms
+% of the GNU General Public License - see the file COPYING in the
+% Mercury Distribution.
+%
% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
%
% This module puts everything together, compiling a list of lexemes
% into state machines and turning the input stream into a token stream.
%
-% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-%
%----------------------------------------------------------------------------- %
:- module lex.
@@ -24,31 +29,36 @@
:- type annotated_lexeme(Token)
== lexeme(annotated_token(Token)).
+:- type token_creator(Token) == (func(string)=Token).
+:- inst token_creator == (func(in)=out is det).
+
:- type lexeme(Token)
---> lexeme(
lxm_token :: Token,
lxm_regexp :: regexp
).
+:- inst lexeme(Inst) == bound(lexeme(Inst,ground)).
:- type annotated_token(Token)
- ---> noval(Token) % Just return ok(Token) on success.
- ; value(Token) % Return ok(Token, String) on success.
- ; ignore. % Just skip over these tokens.
+ ---> t(token_creator(Token)) % Return ok(apply(TokenCreator, Match))
+ % on success.
+ ; noval(Token) % Just return ok(Match) on success
+ ; ignore. % Just skip over these tokens.
+:- inst annotated_token == bound( t(token_creator); noval(ground); ignore).
+
:- type lexer(Token, Source).
-:- inst lexer
- == bound(lexer(ground, read_pred)).
+:- inst lexer == bound(lexer(ground, read_pred)).
+
:- type lexer_state(Token, Source).
:- type lexer_result(Token)
- ---> ok(Token) % Noval token matched.
- ; ok(Token, string) % Value token matched.
+ ---> ok(Token) % Token matched.
; eof % End of input.
; error(int). % No matches for string at this offset.
-
:- type offset
== int. % Byte offset into the source data.
@@ -184,17 +194,19 @@
lexi_buf_state :: buf_state(Source)
).
:- inst lexer_instance
- == bound(lexer_instance(ground, ground, ground, buf_state)).
-
-
+ == bound(lexer_instance(live_lexeme_list, live_lexeme_list,
+ winner, buf_state)).
:- type live_lexeme(Token)
== compiled_lexeme(annotated_token(Token)).
+:- inst live_lexeme == compiled_lexeme(annotated_token).
+:- inst live_lexeme_list == list__list_skel(live_lexeme).
:- type winner(Token)
== maybe(pair(annotated_token(Token), offset)).
+:- inst winner == bound( yes(bound(annotated_token - ground)); no).
%----------------------------------------------------------------------------- %
@@ -293,7 +305,7 @@
:- pred process_any_winner(lexer_result(Tok), winner(Tok),
lexer_instance(Tok, Src), lexer_instance(Tok, Src),
buf_state(Src), buf, buf, Src, Src).
-:- mode process_any_winner(out, in,
+:- mode process_any_winner(out, in(winner),
in(lexer_instance), out(lexer_instance),
in(buf_state), array_di, array_uo, di, uo) is det.
@@ -306,14 +318,15 @@
^ lexi_current_winner := no)
^ lexi_buf_state := buf__commit(BufState1)),
(
- ATok = noval(Token),
- Result = ok(Token),
+ ATok = t(TokenCreator),
+ Match = buf__string_to_cursor(BufState1, Buf),
+ Result = ok(apply(TokenCreator, Match)),
Instance = Instance1,
Buf = Buf0,
Src = Src0
;
- ATok = value(Token),
- Result = ok(Token, buf__string_to_cursor(BufState1, Buf)),
+ ATok = noval(Token),
+ Result = ok(Token),
Instance = Instance1,
Buf = Buf0,
Src = Src0
@@ -353,12 +366,13 @@
% Return the token and set things up so that we return
% eof next.
(
+ ATok = t(TokenCreator),
+ Match = buf__string_to_cursor(BufState, Buf),
+ Result = ok(apply(TokenCreator, Match))
+ ;
ATok = noval(Token),
Result = ok(Token)
;
- ATok = value(Token),
- Result = ok(Token, buf__string_to_cursor(BufState, Buf))
- ;
ATok = ignore,
Result = eof
)
@@ -376,7 +390,9 @@
:- pred advance_live_lexemes(char, offset,
list(live_lexeme(Token)), list(live_lexeme(Token)),
winner(Token), winner(Token)).
-:- mode advance_live_lexemes(in, in, in, out, in, out) is det.
+:- mode advance_live_lexemes(in, in, in(live_lexeme_list),
+ out(live_lexeme_list),
+ in(winner), out(winner)) is det.
advance_live_lexemes(_Char, _Offset, [], [], Winner, Winner).
@@ -385,7 +401,7 @@
State0 = L0 ^ clxm_state,
ATok = L0 ^ clxm_token,
- ( if next_state(L0, State0, Char, State, IsAccepting) then
+ ( if lex__next_state(L0, State0, Char, State, IsAccepting) then
(
IsAccepting = no,
@@ -406,14 +422,16 @@
:- pred live_lexeme_in_accepting_state(list(live_lexeme(Tok)),
annotated_token(Tok)).
-:- mode live_lexeme_in_accepting_state(in, out) is semidet.
+:- mode live_lexeme_in_accepting_state(in(live_lexeme_list),
+ out(annotated_token)) is semidet.
live_lexeme_in_accepting_state([L | Ls], Token) :-
- ( if in_accepting_state(L)
+ ( if lex__in_accepting_state(L)
then Token = L ^ clxm_token
else live_lexeme_in_accepting_state(Ls, Token)
).
+
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
@@ -556,3 +574,41 @@
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
+
+:- import_module bitmap.
+:- pred in_accepting_state(live_lexeme(T)).
+:- mode in_accepting_state(in(live_lexeme)) is semidet.
+in_accepting_state(CLXM) :-
+ bitmap__is_set(
+ CLXM ^ clxm_transition_map ^ trm_accepting_states, CLXM ^ clxm_state
+ ).
+
+
+:- pred next_state(live_lexeme(T), state_no, char, state_no, bool).
+:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
+
+next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
+ Rows = CLXM ^ clxm_transition_map ^ trm_rows,
+ AcceptingStates = CLXM ^ clxm_transition_map ^ trm_accepting_states,
+ lex__find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
+ IsAccepting = bitmap__get(AcceptingStates, NextState).
+
+:- pred find_next_state(int, array(byte_transition), state_no).
+:- mode find_next_state(in, in, out) is semidet.
+
+find_next_state(Byte, ByteTransitions, State) :-
+ Lo = array__min(ByteTransitions),
+ Hi = array__max(ByteTransitions),
+ lex__find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
+
+
+:- pred find_next_state_0(int, int, int, array(byte_transition), state_no).
+:- mode find_next_state_0(in, in, in, in, out) is semidet.
+
+find_next_state_0(Lo, Hi, Byte, ByteTransitions, State) :-
+ Lo =< Hi,
+ ByteTransition = ByteTransitions ^ elem(Lo),
+ ( if ByteTransition ^ btr_byte = Byte
+ then State = ByteTransition ^ btr_state
+ else lex__find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
+ ).
Index: samples/demo.m
===================================================================
RCS file: /usr/local/dev/cvs/rat/rationality/logic/mercury/extras/lex/samples/demo.m,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- samples/demo.m 2001/07/13 09:33:56 1.1
+++ samples/demo.m 2001/07/26 08:07:46 1.2
@@ -1,12 +1,17 @@
%----------------------------------------------------------------------------- %
% demo.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% Sun Aug 20 18:11:42 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
-%
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+% The changes made by Rationalizer are contributed under the terms
+% of the GNU General Public License - see the file COPYING in the
+% Mercury Distribution.
+%
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
%
%----------------------------------------------------------------------------- %
@@ -66,48 +71,58 @@
%----------------------------------------------------------------------------- %
:- type token
- ---> noun
- ; comment
- ; integer
- ; real
- ; verb
- ; conj
- ; prep
+ ---> noun(string)
+ ; comment(string)
+ ; integer(int)
+ ; real(float)
+ ; verb(string)
+ ; conj(string)
+ ; prep(string)
; punc
.
:- func lexemes = list(annotated_lexeme(token)).
lexemes = [
- lexeme(value(comment),
- (atom('%') >> junk)),
+% lexeme(ignore, (atom('%') >> junk)),
+ lexeme( t(func(Match) = comment(Match)), (atom('%') >> junk)),
+% lexeme( t(comment), (atom('%') >> junk)),
- lexeme(value(integer),
+ lexeme( t(func(Match) = integer(string__det_to_int(Match))),
(signed_int)),
- lexeme(value(real),
- (real)),
+ lexeme( t(func(Match) = real(det_string_to_float(Match))), (real)),
- lexeme(value(noun), str("cat")),
- lexeme(value(noun), str("dog")),
- lexeme(value(noun), str("rat")),
- lexeme(value(noun), str("mat")),
+ lexeme( t(func(Match) = noun(Match)), str("cat")),
+ lexeme( t(func(Match) = noun(Match)), str("dog")),
+ lexeme( t(func(Match) = noun(Match)), str("rat")),
+ lexeme( t(func(Match) = noun(Match)), str("mat")),
- lexeme(value(verb),
+ lexeme( t(func(Match) = verb(Match)),
(str("sat") \/ str("caught") \/ str("chased"))),
- lexeme(value(conj),
+ lexeme( t(func(Match) = conj(Match)),
(str("and") \/ str("then"))),
- lexeme(value(prep),
+ lexeme( t(func(Match) = prep(Match)),
(str("the") \/ str("it") \/ str("them") \/ str("to") \/ str("on"))),
lexeme(noval(punc),
(any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./"))),
- lexeme(ignore,
- whitespace)
+ lexeme(ignore, whitespace)
].
+:- func det_string_to_float(string) = float.
+
+:- import_module require.
+
+det_string_to_float(String) = Float :-
+ string__to_float(String,Float)
+ ;
+ error("Wrong regular expression chosen for token `real'").
+
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
+
+
More information about the reviews
mailing list