[m-rev.] For review: lex and moose changed
Holger Krug
hkrug at rationalizer.com
Mon Aug 6 02:55:04 AEST 2001
Here comes the implementation of what was discussed in this thread last
week.
In the following points I took the liberty to deviate from what was
discussed:
* I did not change the operator `\/' and `/\' to `or' and `++'
because the operators `\/' and `/\' in my view greatly enhance
the readability of regular expression.
* I did not remove the prefix `lxr_'. It's necessary, because one
field is named `lxr_instance'.
* I changed the determinism of `lex__read/3' to `cc_multi'. This was
necessary to allow the user to throw exceptions within the
token functions. According to Fergus' recommendations in
mercury-users/mercury-users.0011/006{2,4}.html it would not be allowed
in a situation like this to use `promise_only_solution' because the
functions throwing the exceptions to be caught are defined by the
user.
* The tokens to be ignored can be given by a predicate of type `pred(Token)',
no, as proposed by Ralph, by a value of type `maybe(Token)'.
The CVS log message should be like:
* the value contained in a token now may belong to an arbitrary
Mercury type, not only string as before
* lexemes now may contain a function to compute an arbitrary token
value from the string matched
* the lexeme definition by the user now is more difficult,
because the user has to name higher order functions
* the advantage is that no postprocessing is needed to evaluate
the string, hence the interface to the consumer of the tokens is
cleaner
* determinism of lex__read/3 changed from `det' to `cc_multi' to allow
the user to throw a string-typed exception inside a token function,
from which an error token is formed. The exception is caught in
the body of lex__process_any_winner/9. Following
mercury-users/mercury-users.0011/0062.html and
mercury-users/mercury-users.00110064.html it would not be sound to use
`promise_only_solution'. `promise_only_solution', because the
exception is user defined and not controlled inside lex. The user
might well provide a token function which throws more than one
string-typed exception.
* predicate lex__init/3 added which has as third argument a predicate
succeeding on tokens which have to be ignored
* typeclass regexp(T) added, the definition of type now is hidden
from the user
* function `->' added, which is a useful wrapper when defining lexemes
* for more information about these changes see the mercury-reviews
thread starting at mercury-reviews/mercury-reviews.0107/0378.html and
continued at mercury-reviews/mercury-reviews.0108/0028.html
The diff is appended.
The corresponding diff to moose follows soon. I still have to change
the determinism to `cc_multi'.
--
Holger Krug
hkrug at rationalizer.com
-------------- next part --------------
Index: README
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/README,v
retrieving revision 1.1
diff -u -r1.1 README
--- README 2001/02/21 16:29:36 1.1
+++ README 2001/08/05 16:36:40
@@ -1,12 +1,14 @@
lex 1.0 (very alpha)
Fri Aug 25 17:54:28 2000
Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
-
THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-
-
+Sun Aug 5 16:15:27 UTC 2001
+Copyright (C) 2001 The Rationalizer Intelligent Software AG
+ The changes made by Rationalizer are contributed under the terms
+ of the GNU Free Documentation License - see the file COPYING in the
+ Mercury Distribution.
This package defines a lexer for Mercury. There is plenty of scope
for optimization, however it is reasonably efficient and does provide
@@ -23,25 +25,34 @@
:- type token
---> comment
- ; id
- ; num.
+ ; id(string)
+ ; num(int)
+ ; space.
3. Set up a list of annotated_lexemes.
Lexemes = [
- lexeme(noval(comment), (atom('%'), star(dot))),
- lexeme(value(id), identifier),
- lexeme(ignore, whitespace)
+ ('%' >> *(dot)) - return(comment),
+ identifier - (func(Match) = id(Match)),
+ signed_int - string__det_to_int,
+ whitespace - space
]
-noval tokens are simply identified;
-value tokens are identified and returned with the string matched;
-ignore regexps are simply passed over.
+Lexemes are pairs. Their first entry is a regular expression. The
+second entry is a token_creator, i.e. a function to compute tokens
+from strings, namely from the string matched by the
+regular expression forming the first part of the lexeme.
4. Set up a lexer with an appropriate read predicate (see the buf module).
Lexer = lex__init(Lexemes, lex__read_from_stdin)
+ or:
+
+ :- pred ignore_space(token::in) is semidet.
+ ignore_space(space).
+ Lexer = lex__init(Lexemes, lex__read_from_stdin, ignore_space)
+
5. Obtain a live lexer state.
State0 = lex__start(Lexer, IO0)
@@ -49,12 +60,21 @@
6. Use it to lex the input stream.
lex__read(Result, State0, State1),
- ( Result = ok(NoValToken), ...
- ; Result = ok(ValueToken, String), ...
- ; Result = error(OffsetInInputStream), ...
+ ( Result = ok(Token), ...
+ ; Result = error(Message, OffsetInInputStream), ...
; Result = eof, ...
)
+ Attention: The result type of lex__read is io__read_result(token).
+ io__read_result is documented in the library file io.m as:
+ :- type io__read_result(T) ---> ok(T)
+ ; eof
+ ; error(string, int).
+ % error message, line number
+ In contrast to this the `int' lex returns in the case of an error
+ does not correspond to the line number but to the character offset.
+ Hence be careful when processing lex errors.
+
7. If you need to manipulate the source object, you can.
lex__manipulate_source(io__print("Not finished yet?"), State1, State2)
@@ -69,7 +89,12 @@
and the option to write out a compilable source file for the lexer.
+OPPORTUNITIES FOR MODULARIZATION
+1. Remove regexp functionality from lex and put it into distinct
+ regexp library.
+
+
OPPORTUNITIES FOR OPTIMIZATION
1. Move from chars to bytes.
@@ -79,3 +104,18 @@
is decided by the first byte read in on a lexing pass.
4. Implement state machine minimization (may or may not be worthwhile.)
+
+FEATURES TO ADD:
+
+1. Symbol table management (additional parameters for the user-defined
+ predicates containing the symbol table before and after processing
+ a lexeme)
+2. func (string) = regexp, where the function parameter contains a
+ regexp definition in a form like used in languages in Perl, awk etc.
+3. line# as part of the offset
+4. the lexer-result type is changed to:
+ ok(token, token_position)
+ where token_position contains information like: offset, line nr, offset
+ in line, length in chars, lenght in lines, etc.
+5. interface to editors (emacs, vim) to allow to move the editor cursor
+ to the lexer position
Index: lex.buf.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.buf.m,v
retrieving revision 1.1
diff -u -r1.1 lex.buf.m
--- lex.buf.m 2001/02/21 16:29:36 1.1
+++ lex.buf.m 2001/08/05 16:36:40
@@ -94,7 +94,7 @@
:- interface.
-:- import_module int, array, char, bool, string, io.
+:- import_module array, char, bool, string.
@@ -171,7 +171,6 @@
:- implementation.
:- import_module exception.
-
% The amount the buffer is grown by if (a) more space is
Index: lex.lexeme.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.lexeme.m,v
retrieving revision 1.1
diff -u -r1.1 lex.lexeme.m
--- lex.lexeme.m 2001/02/21 16:29:36 1.1
+++ lex.lexeme.m 2001/08/05 16:36:40
@@ -1,16 +1,21 @@
%-----------------------------------------------------------------------------
% lex.lexeme.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% Sat Aug 19 08:22:32 BST 2000
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
+% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+% The changes made by Rationalizer are contributed under the terms
+% of the GNU Lesser General Public License.
+%
% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
%
% A lexeme combines a token with a regexp. The lexer compiles
% lexemes and returns the longest successul parse in the input
% stream or an error if no match occurs.
%
-% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
%
%----------------------------------------------------------------------------- %
@@ -22,15 +27,17 @@
:- type compiled_lexeme(T)
---> compiled_lexeme(
- clxm_token :: T,
- clxm_state :: state_no,
- clxm_transition_map :: transition_map
+ token :: token_creator(T),
+ state :: state_no,
+ transition_map :: transition_map
).
+:- inst compiled_lexeme(Inst) --->
+ lex__lexeme__compiled_lexeme(Inst,ground,ground).
:- type transition_map
---> transition_map(
- trm_accepting_states :: bitmap,
- trm_rows :: array(row)
+ accepting_states :: bitmap,
+ rows :: array(row)
).
% A transition row is an array of byte_transitions.
@@ -59,12 +66,12 @@
% an accepting state_no.
%
:- pred next_state(compiled_lexeme(T), state_no, char, state_no, bool).
-:- mode next_state(in, in, in, out, out) is semidet.
+:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
% Succeeds iff a compiled_lexeme is in an accepting state_no.
%
-:- pred in_accepting_state(compiled_lexeme(T)).
-:- mode in_accepting_state(in) is semidet.
+:- pred in_accepting_state(live_lexeme(T)).
+:- mode in_accepting_state(in(live_lexeme)) is semidet.
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
@@ -77,7 +84,7 @@
%------------------------------------------------------------------------------%
compile_lexeme(Lexeme) = CompiledLexeme :-
- Lexeme = lexeme(Token, RegExp),
+ Lexeme = (RegExp - TokenCreator),
NFA = remove_null_transitions(regexp_to_NFA(RegExp)),
DFA = convert_NFA_to_DFA(NFA),
StartState = DFA ^ smc_start_state,
@@ -87,7 +94,7 @@
Accepting = set_accepting_states(StopStates, bitmap__new(N, no)),
Rows = array(set_up_rows(0, N, Transitions)),
TransitionMap = transition_map(Accepting, Rows),
- CompiledLexeme = compiled_lexeme(Token, StartState, TransitionMap).
+ CompiledLexeme = compiled_lexeme(TokenCreator, StartState, TransitionMap).
%------------------------------------------------------------------------------%
@@ -157,9 +164,9 @@
%------------------------------------------------------------------------------%
next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
- Rows = CLXM ^ clxm_transition_map ^ trm_rows,
- AcceptingStates = CLXM ^ clxm_transition_map ^ trm_accepting_states,
- find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
+ Rows = CLXM ^ transition_map ^ rows,
+ AcceptingStates = CLXM ^ transition_map ^ accepting_states,
+ lex__lexeme__find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
IsAccepting = bitmap__get(AcceptingStates, NextState).
%------------------------------------------------------------------------------%
@@ -170,7 +177,7 @@
find_next_state(Byte, ByteTransitions, State) :-
Lo = array__min(ByteTransitions),
Hi = array__max(ByteTransitions),
- find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
+ lex__lexeme__find_next_state_0(Lo, Hi, Byte, ByteTransitions, State).
@@ -182,14 +189,14 @@
ByteTransition = ByteTransitions ^ elem(Lo),
( if ByteTransition ^ btr_byte = Byte
then State = ByteTransition ^ btr_state
- else find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
+ else lex__lexeme__find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
).
%------------------------------------------------------------------------------%
in_accepting_state(CLXM) :-
bitmap__is_set(
- CLXM ^ clxm_transition_map ^ trm_accepting_states, CLXM ^ clxm_state
+ CLXM ^ transition_map ^ accepting_states, CLXM ^ state
).
%------------------------------------------------------------------------------%
Index: lex.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.m,v
retrieving revision 1.1
diff -u -r1.1 lex.m
--- lex.m 2001/02/21 16:29:36 1.1
+++ lex.m 2001/08/05 16:36:40
@@ -2,15 +2,19 @@
% lex.m
% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% Sun Aug 20 09:08:46 BST 2000
+% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+% The changes made by Rationalizer are contributed under the terms
+% of the GNU Lesser General Public License.
+%
% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
%
% This module puts everything together, compiling a list of lexemes
% into state machines and turning the input stream into a token stream.
%
-% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-%
%----------------------------------------------------------------------------- %
:- module lex.
@@ -18,36 +22,19 @@
:- interface.
:- import_module std_util, string, char, list, io.
-
-
-:- type annotated_lexeme(Token)
- == lexeme(annotated_token(Token)).
+:- type token_creator(Token) == (func(string)=Token).
+:- inst token_creator == (func(in)=out is det).
-:- type lexeme(Token)
- ---> lexeme(
- lxm_token :: Token,
- lxm_regexp :: regexp
- ).
+:- type lexeme(Token) == pair(regexp, token_creator(Token)).
-:- type annotated_token(Token)
- ---> noval(Token) % Just return ok(Token) on success.
- ; value(Token) % Return ok(Token, String) on success.
- ; ignore. % Just skip over these tokens.
+:- inst lexeme(Inst) ---> ground - Inst.
:- type lexer(Token, Source).
-:- inst lexer
- == bound(lexer(ground, read_pred)).
-
-:- type lexer_state(Token, Source).
+:- inst lexer ---> lex__lexer(ground, lex__ignore_pred, lex__read_pred).
-:- type lexer_result(Token)
- ---> ok(Token) % Noval token matched.
- ; ok(Token, string) % Value token matched.
- ; eof % End of input.
- ; error(int). % No matches for string at this offset.
-
+:- type lexer_state(Token, Source).
:- type offset
== int. % Byte offset into the source data.
@@ -65,28 +52,44 @@
== pred(offset, read_result, T, T).
:- inst read_pred
== ( pred(in, out, di, uo) is det ).
-
+ % ignore_pred(Token): if it does not fail, Token must be ignored
+ %
+:- type ignore_pred(Tok)
+ == pred(Tok).
+:- inst ignore_pred
+ == ( pred(in) is semidet ).
% The type of regular expressions.
%
-:- type regexp
- ---> null % The empty regexp
- ; atom(char) % Match a single char
- ; (regexp >> regexp) % Concatenation
- ; (regexp \/ regexp) % Alternation
- ; star(regexp) % Kleene closure
- .
+:- type regexp.
+ % The typeclass for types having a natural converter to regexp's
+ %
+:- typeclass regexp(T) where [
+ func re(T) = regexp
+].
+
+ % Handling regexp's based on the typeclass regexp(T)
+ %
+:- func null = regexp.
+:- func T1 >> T2 = regexp <= (regexp(T1), regexp(T2)).
+:- func T1 \/ T2 = regexp <= (regexp(T1), regexp(T2)).
+:- func *(T1) = regexp <= regexp(T1).
+ % Some instances of typeclass regexp(T)
+ %
+:- instance regexp(regexp).
+:- instance regexp(char).
+:- instance regexp(string).
% Some basic non-primitive regexps.
%
-:- func str(string) = regexp. % str("abc") = atom(a) >> atom(b) >> atom(c)
-:- func any(string) = regexp. % any("abc") = atom(a) \/ atom(b) \/ atom(c)
+:- func str(string) = regexp. % str("abc") = re(a) >> re(b) >> re(c)
+:- func any(string) = regexp. % any("abc") = re(a) \/ re(b) \/ re(c)
:- func anybut(string) = regexp.% anybut("abc") is complement of any("abc")
-:- func opt(regexp) = regexp. % opt(R) = R \/ null
-:- func plus(regexp) = regexp. % plus(R) = R \/ star(R)
+:- func ?(regexp) = regexp. % ?(R) = R \/ null
+:- func +(regexp) = regexp. % +(R) = R \/ *(R)
% Some useful single-char regexps.
%
@@ -105,23 +108,38 @@
% Some useful compound regexps.
%
-:- func nat = regexp. % nat = plus(digit)
-:- func signed_int = regexp. % signed_int = opt(any("+-")) >> nat
+:- func nat = regexp. % nat = +(digit)
+:- func signed_int = regexp. % signed_int = ?(any("+-")) >> nat
:- func real = regexp. % real = \d+((.\d+([eE]int)?)|[eE]int)
-:- func identifier = regexp. % identifier = identstart >> star(ident)
-:- func whitespace = regexp. % whitespace = star(wspc)
-:- func junk = regexp. % junk = star(dot)
+:- func identifier = regexp. % identifier = identstart >> *(ident)
+:- func whitespace = regexp. % whitespace = *(wspc)
+:- func junk = regexp. % junk = *(dot)
+ % Utility function to return noval tokens.
+ % Use it in the form `return(my_token) inside a lexeme definition.
+:- func return(T,S) = T.
+
+ % Utility operator to create lexemes.
+:- func (T1 -> token_creator(Tok)) = pair(regexp,token_creator(Tok))
+ <= regexp(T1).
% Construct a lexer from which we can generate running
% instances.
%
-:- func init(list(annotated_lexeme(Tok)), read_pred(Src)) = lexer(Tok, Src).
+:- func init(list(lexeme(Tok)), read_pred(Src)) = lexer(Tok, Src).
:- mode init(in, in(read_pred)) = out(lexer) is det.
- % Handy read predicates.
+ % Construct a lexer from which we can generate running
+ % instances. If we construct a lexer with init/4, we
+ % can additionally ignore specific tokens.
%
+:- func init(list(lexeme(Tok)), read_pred(Src), ignore_pred(Tok)) =
+ lexer(Tok, Src).
+:- mode init(in, in(read_pred), in(ignore_pred)) = out(lexer) is det.
+
+ % Handy read predicates.
+ %
:- pred read_from_stdin(offset, read_result, io__state, io__state).
:- mode read_from_stdin(in, out, di, uo) is det.
@@ -135,8 +153,8 @@
:- func start(lexer(Tok, Src), Src) = lexer_state(Tok, Src).
:- mode start(in(lexer), di) = uo is det.
-:- pred read(lexer_result(Tok), lexer_state(Tok, Src), lexer_state(Tok, Src)).
-:- mode read(out, di, uo) is det.
+:- pred read(io__read_result(Tok), lexer_state(Tok, Src), lexer_state(Tok, Src)).
+:- mode read(out, di, uo) is cc_multi.
% Stop a running instance of a lexer and retrieve the input source.
%
@@ -171,6 +189,7 @@
:- type lexer(Token, Source)
---> lexer(
lex_compiled_lexemes :: list(live_lexeme(Token)),
+ lex_ignore_pred :: ignore_pred(Token),
lex_buf_read_pred :: read_pred(Source)
).
@@ -178,27 +197,45 @@
:- type lexer_instance(Token, Source)
---> lexer_instance(
- lexi_init_lexemes :: list(live_lexeme(Token)),
- lexi_live_lexemes :: list(live_lexeme(Token)),
- lexi_current_winner :: winner(Token),
- lexi_buf_state :: buf_state(Source)
+ init_lexemes :: list(live_lexeme(Token)),
+ live_lexemes :: list(live_lexeme(Token)),
+ current_winner :: winner(Token),
+ buf_state :: buf_state(Source),
+ ignore_pred :: ignore_pred(Token)
).
:- inst lexer_instance
- == bound(lexer_instance(ground, ground, ground, buf_state)).
-
-
+ ---> lex__lexer_instance(lex__live_lexeme_list,
+ lex__live_lexeme_list,
+ lex__winner,
+ lex__buf__buf_state,
+ lex__ignore_pred).
:- type live_lexeme(Token)
- == compiled_lexeme(annotated_token(Token)).
+ == compiled_lexeme(Token).
+:- inst live_lexeme == compiled_lexeme(token_creator).
+:- inst live_lexeme_list == list__list_skel(live_lexeme).
:- type winner(Token)
- == maybe(pair(annotated_token(Token), offset)).
+ == maybe(pair(token_creator(Token), offset)).
+:- inst winner ---> yes(bound(token_creator - ground))
+ ; no.
+
+%----------------------------------------------------------------------------- %
+return(Token, _) = Token.
+
+%----------------------------------------------------------------------------- %
+(R1 -> TC) = (re(R1) - TC).
%----------------------------------------------------------------------------- %
+
+init(Lexemes, BufReadPred) =
+ init(Lexemes, BufReadPred, (pred(_Tok::in) is semidet :-
+ std_util__semidet_fail)).
-init(Lexemes, BufReadPred) = lexer(CompiledLexemes, BufReadPred) :-
+init(Lexemes, BufReadPred, IgnorePred) =
+ lexer(CompiledLexemes, IgnorePred, BufReadPred) :-
CompiledLexemes = list__map(lexeme__compile_lexeme, Lexemes).
%----------------------------------------------------------------------------- %
@@ -215,7 +252,9 @@
init_lexer_instance(Lexer, Instance, Buf) :-
buf__init(Lexer ^ lex_buf_read_pred, BufState, Buf),
InitLexemes = Lexer ^ lex_compiled_lexemes,
- Instance = lexer_instance(InitLexemes, InitLexemes, no, BufState).
+ IgnorePred = Lexer ^ lex_ignore_pred,
+ Instance = lexer_instance(InitLexemes, InitLexemes, no,
+ BufState, IgnorePred).
%----------------------------------------------------------------------------- %
@@ -231,19 +270,19 @@
-:- pred read_0(lexer_result(Tok),
+:- pred read_0(io__read_result(Tok),
lexer_instance(Tok, Src), lexer_instance(Tok, Src),
buf, buf, Src, Src).
:- mode read_0(out,
in(lexer_instance), out(lexer_instance),
- array_di, array_uo, di, uo) is det.
+ array_di, array_uo, di, uo) is cc_multi.
% Basically, just read chars from the buf and advance the live lexemes
% until we have a winner or hit an error (no parse).
%
read_0(Result, Instance0, Instance, Buf0, Buf, Src0, Src) :-
- BufState0 = Instance0 ^ lexi_buf_state,
+ BufState0 = Instance0 ^ buf_state,
buf__read(BufReadResult, BufState0, BufState1, Buf0, Buf1, Src0, Src1),
(
@@ -259,17 +298,17 @@
%----------------------------------------------------------------------------- %
-:- pred process_char(lexer_result(Tok), char,
+:- pred process_char(io__read_result(Tok), char,
lexer_instance(Tok, Src), lexer_instance(Tok, Src),
buf_state(Src), buf, buf, Src, Src).
:- mode process_char(out, in, in(lexer_instance), out(lexer_instance),
- in(buf_state), array_di, array_uo, di, uo) is det.
+ in(buf_state), array_di, array_uo, di, uo) is cc_multi.
process_char(Result, Char, Instance0, Instance,
BufState, Buf0, Buf, Src0, Src) :-
- LiveLexemes0 = Instance0 ^ lexi_live_lexemes,
- Winner0 = Instance0 ^ lexi_current_winner,
+ LiveLexemes0 = Instance0 ^ live_lexemes,
+ Winner0 = Instance0 ^ current_winner,
advance_live_lexemes(Char, buf__cursor_offset(BufState),
LiveLexemes0, LiveLexemes, Winner0, Winner),
@@ -282,44 +321,57 @@
LiveLexemes = [_ | _], % Still some open possibilities.
Instance1 = (((Instance0
- ^ lexi_live_lexemes := LiveLexemes)
- ^ lexi_current_winner := Winner)
- ^ lexi_buf_state := BufState),
+ ^ live_lexemes := LiveLexemes)
+ ^ current_winner := Winner)
+ ^ buf_state := BufState),
read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
).
%----------------------------------------------------------------------------- %
-:- pred process_any_winner(lexer_result(Tok), winner(Tok),
+:- pred process_any_winner(io__read_result(Tok), winner(Tok),
lexer_instance(Tok, Src), lexer_instance(Tok, Src),
buf_state(Src), buf, buf, Src, Src).
-:- mode process_any_winner(out, in,
+:- mode process_any_winner(out, in(winner),
in(lexer_instance), out(lexer_instance),
- in(buf_state), array_di, array_uo, di, uo) is det.
+ in(buf_state), array_di, array_uo, di, uo) is cc_multi.
-process_any_winner(Result, yes(ATok - Offset), Instance0, Instance,
+process_any_winner(Result, yes(TokenCreator - Offset), Instance0, Instance,
BufState0, Buf0, Buf, Src0, Src) :-
BufState1 = buf__rewind_cursor(Offset, BufState0),
Instance1 = ((( Instance0
- ^ lexi_live_lexemes := Instance0 ^ lexi_init_lexemes)
- ^ lexi_current_winner := no)
- ^ lexi_buf_state := buf__commit(BufState1)),
+ ^ live_lexemes := Instance0 ^ init_lexemes)
+ ^ current_winner := no)
+ ^ buf_state := buf__commit(BufState1)),
+ Match = buf__string_to_cursor(BufState1, Buf0),
+ try( pred(Tok::out) is det :- Tok = TokenCreator(Match), TryRes ),
(
- ATok = noval(Token),
- Result = ok(Token),
- Instance = Instance1,
- Buf = Buf0,
- Src = Src0
- ;
- ATok = value(Token),
- Result = ok(Token, buf__string_to_cursor(BufState1, Buf)),
- Instance = Instance1,
- Buf = Buf0,
- Src = Src0
+ TryRes = succeeded(TokenCreatorRes),
+ Token = TokenCreatorRes,
+ IgnorePred = Instance0 ^ ignore_pred,
+ (
+ IgnorePred(Token)
+ ->
+ read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
+ ;
+ Result = ok(Token),
+ Instance = Instance1,
+ Buf = Buf0,
+ Src = Src0
+ )
;
- ATok = ignore,
- read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
+ TryRes = exception(Exc) ,
+ (
+ std_util__univ_to_type(Exc,Mes)
+ ->
+ Result = error(Mes, Offset)
+ ;
+ rethrow(TryRes)
+ ),
+ Instance = Instance1,
+ Buf = Buf0,
+ Src = Src0
).
process_any_winner(Result, no, Instance0, Instance,
@@ -327,16 +379,16 @@
Start = buf__start_offset(BufState0),
BufState1 = buf__rewind_cursor(Start + 1, BufState0),
- Result = error(Start),
+ Result = error("input not matched by any regexp", Start),
Instance = ((( Instance0
- ^ lexi_live_lexemes :=
- Instance0 ^ lexi_init_lexemes)
- ^ lexi_current_winner := no)
- ^ lexi_buf_state := buf__commit(BufState1)).
+ ^ live_lexemes :=
+ Instance0 ^ init_lexemes)
+ ^ current_winner := no)
+ ^ buf_state := buf__commit(BufState1)).
%----------------------------------------------------------------------------- %
-:- pred process_eof(lexer_result(Tok),
+:- pred process_eof(io__read_result(Tok),
lexer_instance(Tok, Src), lexer_instance(Tok, Src),
buf_state(Src), buf).
:- mode process_eof(out, in(lexer_instance), out(lexer_instance),
@@ -346,44 +398,47 @@
( if
- live_lexeme_in_accepting_state(Instance0 ^ lexi_live_lexemes, ATok)
+ live_lexeme_in_accepting_state(Instance0 ^ live_lexemes, TokenCreator)
then
% Return the token and set things up so that we return
% eof next.
- (
- ATok = noval(Token),
- Result = ok(Token)
- ;
- ATok = value(Token),
- Result = ok(Token, buf__string_to_cursor(BufState, Buf))
- ;
- ATok = ignore,
- Result = eof
- )
+ % handle exceptions
+ Match = buf__string_to_cursor(BufState, Buf),
+ Token = TokenCreator(Match),
+ IgnorePred = Instance0 ^ ignore_pred,
+ (
+ IgnorePred(Token)
+ ->
+ Result = eof
+ ;
+ Result = ok(Token)
+ )
else
Result = eof
),
Instance = ((Instance0
- ^ lexi_live_lexemes := [])
- ^ lexi_buf_state := buf__commit(BufState)).
+ ^ live_lexemes := [])
+ ^ buf_state := buf__commit(BufState)).
%----------------------------------------------------------------------------- %
:- pred advance_live_lexemes(char, offset,
list(live_lexeme(Token)), list(live_lexeme(Token)),
winner(Token), winner(Token)).
-:- mode advance_live_lexemes(in, in, in, out, in, out) is det.
+:- mode advance_live_lexemes(in, in, in(live_lexeme_list),
+ out(live_lexeme_list),
+ in(winner), out(winner)) is det.
advance_live_lexemes(_Char, _Offset, [], [], Winner, Winner).
advance_live_lexemes(Char, Offset, [L0 | Ls0], Ls, Winner0, Winner) :-
- State0 = L0 ^ clxm_state,
- ATok = L0 ^ clxm_token,
+ State0 = L0 ^ state,
+ ATok = L0 ^ token,
( if next_state(L0, State0, Char, State, IsAccepting) then
@@ -395,7 +450,7 @@
Winner1 = yes(ATok - Offset)
),
advance_live_lexemes(Char, Offset, Ls0, Ls1, Winner1, Winner),
- Ls = [( L0 ^ clxm_state := State ) | Ls1]
+ Ls = [( L0 ^ state := State ) | Ls1]
else
@@ -405,15 +460,17 @@
%----------------------------------------------------------------------------- %
:- pred live_lexeme_in_accepting_state(list(live_lexeme(Tok)),
- annotated_token(Tok)).
-:- mode live_lexeme_in_accepting_state(in, out) is semidet.
+ token_creator(Tok)).
+:- mode live_lexeme_in_accepting_state(in(live_lexeme_list),
+ out(token_creator)) is semidet.
live_lexeme_in_accepting_state([L | Ls], Token) :-
- ( if in_accepting_state(L)
- then Token = L ^ clxm_token
+ ( if lex__lexeme__in_accepting_state(L)
+ then Token = L ^ token
else live_lexeme_in_accepting_state(Ls, Token)
).
+
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
@@ -471,6 +528,39 @@
Result = eof
).
+
+%----------------------------------------------------------------------------- %
+% The type of regular expressions.
+:- type regexp
+ ---> re_null % The empty regexp
+ ; re_atom(char) % Match a single char
+ ; re_concat(regexp,regexp) % Concatenation
+ ; re_or(regexp, regexp) % Alternation
+ ; re_star(regexp) % Kleene closure
+ .
+
+%----------------------------------------------------------------------------- %
+% Some instances of typeclass regexp(T)
+:- instance regexp(regexp) where [
+ re(RE) = RE
+].
+
+:- instance regexp(char) where [
+ re(C) = re_atom(C)
+].
+
+:- instance regexp(string) where [
+ re(S) = str(S)
+].
+
+
+%----------------------------------------------------------------------------- %
+% Basic primitive regexps.
+null = re_null.
+R1 >> R2 = re_concat(re(R1), re(R2)).
+R1 \/ R2 = re_or(re(R1), re(R2)).
+*(R1) = re_star(re(R1)).
+
%----------------------------------------------------------------------------- %
% Some basic non-primitive regexps.
@@ -480,7 +570,7 @@
else
L = string__length(S),
C = string__index_det(S, L - 1),
- R = str_foldr(func(Cx, Rx) = (atom(Cx) >> Rx), S, atom(C), L - 2)
+ R = str_foldr(func(Cx, Rx) = (re(Cx) >> Rx), S, re(C), L - 2)
).
any(S) = R :-
@@ -489,7 +579,7 @@
else
L = string__length(S),
C = string__index_det(S, L - 1),
- R = str_foldr(func(Cx, Rx) = (atom(Cx) \/ Rx), S, atom(C), L - 2)
+ R = str_foldr(func(Cx, Rx) = (re(Cx) \/ Rx), S, re(C), L - 2)
).
anybut(S0) = R :-
@@ -511,9 +601,9 @@
else str_foldr(Fn, S, Fn(string__index_det(S, I), X), I - 1)
).
-opt(R) = (R \/ null).
+?(R) = (R \/ null).
-plus(R) = (R >> star(R)).
++(R) = (R >> *(R)).
%----------------------------------------------------------------------------- %
% Some useful single-char regexps.
@@ -535,24 +625,24 @@
dot = anybut("\n").
alpha = (lower \/ upper).
alphanum = (alpha \/ digit).
-identstart = (alpha \/ atom('_')).
-ident = (alphanum \/ atom('_')).
-nl = atom('\n').
-tab = atom('\t').
-spc = atom(' ').
+identstart = (alpha \/ re('_')).
+ident = (alphanum \/ re('_')).
+nl = re('\n').
+tab = re('\t').
+spc = re(' ').
%----------------------------------------------------------------------------- %
% Some useful compound regexps.
-nat = plus(digit).
-signed_int = (opt(any("+-")) >> nat).
+nat = +(digit).
+signed_int = (?(any("+-")) >> nat).
real = (signed_int >> (
- (atom('.') >> nat >> opt(any("eE") >> signed_int)) \/
+ (re('.') >> nat >> ?(any("eE") >> signed_int)) \/
(any("eE") >> signed_int)
)).
-identifier = (identstart >> star(ident)).
-whitespace = star(wspc).
-junk = star(dot).
+identifier = (identstart >> *(ident)).
+whitespace = *(wspc).
+junk = *(dot).
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
Index: lex.regexp.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.regexp.m,v
retrieving revision 1.1
diff -u -r1.1 lex.regexp.m
--- lex.regexp.m 2001/02/21 16:29:37 1.1
+++ lex.regexp.m 2001/08/05 16:36:40
@@ -1,16 +1,20 @@
%----------------------------------------------------------------------------- %
% lex.regexp.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% Fri Aug 18 06:43:09 BST 2000
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
+% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
+% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
+% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+% The changes made by Rationalizer are contributed under the terms
+% of the GNU Lesser General Public License.
+%
% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
%
% Converts basic regular expressions into non-deterministic finite
% automata (NFAs).
%
-% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
-% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
-% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
-%
%----------------------------------------------------------------------------- %
:- module lex__regexp.
@@ -53,20 +57,20 @@
% The primitive regexps.
-compile(X, null, Y, [null(X, Y)]) --> [].
+compile(X, re_null, Y, [null(X, Y)]) --> [].
-compile(X, atom(C), Y, [trans(X, C, Y)]) --> [].
+compile(X, re_atom(C), Y, [trans(X, C, Y)]) --> [].
-compile(X, (RA >> RB), Y, TsA ++ TsB) -->
+compile(X, re_concat(RA,RB), Y, TsA ++ TsB) -->
counter__allocate(Z),
compile(X, RA, Z, TsA),
compile(Z, RB, Y, TsB).
-compile(X, (RA \/ RB), Y, TsA ++ TsB) -->
+compile(X, re_or(RA,RB), Y, TsA ++ TsB) -->
compile(X, RA, Y, TsA),
compile(X, RB, Y, TsB).
-compile(X, star(R), Y, TsA ++ TsB) -->
+compile(X, re_star(R), Y, TsA ++ TsB) -->
compile(X, null, Y, TsA),
compile(X, R, X, TsB).
Index: samples/demo.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/samples/demo.m,v
retrieving revision 1.1
diff -u -r1.1 demo.m
--- samples/demo.m 2001/02/21 16:29:39 1.1
+++ samples/demo.m 2001/08/05 16:36:40
@@ -1,12 +1,17 @@
%----------------------------------------------------------------------------- %
% demo.m
-% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% Sun Aug 20 18:11:42 BST 2000
-% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
-%
+% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
% THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
% BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
% BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
+% Thu Jul 26 07:45:47 UTC 2001
+% Copyright (C) 2001 The Rationalizer Intelligent Software AG
+% The changes made by Rationalizer are contributed under the terms
+% of the GNU General Public License - see the file COPYING in the
+% Mercury Distribution.
+%
+% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
%
%----------------------------------------------------------------------------- %
@@ -16,14 +21,14 @@
:- import_module io.
-:- pred main(io__state::di, io__state::uo) is det.
+:- pred main(io__state::di, io__state::uo) is cc_multi.
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
:- implementation.
-:- import_module char, string, exception, array, list.
+:- import_module char, string, exception, array, list, std_util, int.
:- import_module lex.
%----------------------------------------------------------------------------- %
@@ -38,10 +43,18 @@
Try me...
+I don't recognise any `pragma', nevertheless try it, because I give
+you a nice error message. Even more, I don't recognise any `module',
+too, but the error message is not so nice.
+
+... Now I'm even able to count numbers. It's great, isn't it. But not
+to big ones, anywhere between 10 and 20 I lose the thread. Do not
+expect to much of me.
+
"),
- { Lexer = lex__init(lexemes, lex__read_from_stdin) },
- call((pred(IO0::di, IO::uo) is det :-
+ { Lexer = lex__init(lexemes, lex__read_from_stdin, ignore_space) },
+ call((pred(IO0::di, IO::uo) is cc_multi :-
State0 = lex__start(Lexer, IO0),
tokenise_stdin(State0, State),
IO = lex__stop(State)
@@ -49,9 +62,14 @@
%----------------------------------------------------------------------------- %
+:- pred ignore_space(token::in) is semidet.
+ignore_space(space).
+
+%----------------------------------------------------------------------------- %
+
:- pred tokenise_stdin(lexer_state(token, io__state),
lexer_state(token, io__state)).
-:- mode tokenise_stdin(di, uo) is det.
+:- mode tokenise_stdin(di, uo) is cc_multi.
tokenise_stdin -->
lex__read(Result),
@@ -66,48 +84,67 @@
%----------------------------------------------------------------------------- %
:- type token
- ---> noun
- ; comment
- ; integer
- ; real
- ; verb
- ; conj
- ; prep
+ ---> noun(string)
+ ; comment(string)
+ ; integer(int)
+ ; real(float)
+ ; verb(string)
+ ; conj(string)
+ ; prep(string)
; punc
+ ; space
+ ; small_num(int)
.
-:- func lexemes = list(annotated_lexeme(token)).
+:- func lexemes = list(lexeme(token)).
lexemes = [
- lexeme(value(comment),
- (atom('%') >> junk)),
-
- lexeme(value(integer),
- (signed_int)),
-
- lexeme(value(real),
- (real)),
-
- lexeme(value(noun), str("cat")),
- lexeme(value(noun), str("dog")),
- lexeme(value(noun), str("rat")),
- lexeme(value(noun), str("mat")),
-
- lexeme(value(verb),
- (str("sat") \/ str("caught") \/ str("chased"))),
-
- lexeme(value(conj),
- (str("and") \/ str("then"))),
+ ("%" >> junk) - (func(Match) = comment(Match)) ,
+ (signed_int) - (func(Match) = integer(string__det_to_int(Match))),
+ (real) - (func(Match) = real(det_string_to_float(Match))) ,
+ re("cat") - (func(Match) = noun(Match)) ,
+ re("dog") - (func(Match) = noun(Match)) ,
+ % Using `->' instead of `-' it's not necessary to call `re':
+ "rat" -> (func(Match) = noun(Match)) ,
+ "mat" -> (func(Match) = noun(Match)) ,
+ ("sat" \/ "caught" \/ "chased") -
+ (func(Match) = verb(Match)) ,
+ ("and" \/ "then") -
+ (func(Match) = conj(Match)) ,
+ ("the" \/ "it" \/ "them" \/ "to" \/ "on") -
+ (func(Match) = prep(Match)) ,
+ any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./") -
+ return(punc) ,
+
+ % a exception of string type, which is caught by lex:
+ re("pragma") - (func(Match) = noun(Match) :-
+ throw("No pragma, please, I am not Mercury !")),
+
+ % another exception not caught by lex, because of wrong type:
+ re("module") - (func(Match) = noun(Match) :-
+ throw(0815)),
+
+ whitespace - return(space),
+
+ nat - (func(Match) = small_num(Num) :-
+ Num = string__det_to_int(Match),
+ (
+ Num < 14
+ ;
+ throw("Hey, this number was really to big.")
+ ))
+].
- lexeme(value(prep),
- (str("the") \/ str("it") \/ str("them") \/ str("to") \/ str("on"))),
+:- func det_string_to_float(string) = float.
- lexeme(noval(punc),
- (any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./"))),
+:- import_module require.
- lexeme(ignore,
- whitespace)
-].
+det_string_to_float(String) = Float :-
+ string__to_float(String,Float)
+ ;
+ error("Wrong regular expression chosen for token `real'").
%----------------------------------------------------------------------------- %
%----------------------------------------------------------------------------- %
+
+
More information about the reviews
mailing list