[m-rev.] diff: enhancements of lex

Ralph Becket rbeck at microsoft.com
Tue Aug 14 22:43:52 AEST 2001


> From: Holger Krug [mailto:hkrug at rationalizer.com]
> Sent: 14 August 2001 11:38
> 
> For review by Ralph Beckett.
s/Beckett/Becket/ :)

> If there still should remain some cosmetic changes it would be nice if
> you could do them on your copy of the code, commit to the repository
> and inform me by sending a relative diff.

All but one or two comments are cosmetic and I'd be happy to
make those changes after you've checked in your code.

The things that genuinely need fixing are one or two errors in comments
and a small code error in demo.m.

Other than that this looks fine.

- Ralph

> 
> Thanks, Holger
> 
> Summary of changes:
> 
> * the value contained in a token now may belong to an arbitrary
>   Mercury type, not only string as before
> * lexemes now may contain a function to compute an arbitrary token
>   value from the string matched
> * the lexeme definition by the user now is more difficult,
>   because the user has to name higher order functions
> * the advantage is that no postprocessing is needed to evaluate
>   the string, hence the interface to the consumer of the tokens is
>   cleaner
> * predicate `lex__init/3' added which has as third argument a
predicate
>   succeeding on tokens which have to be ignored
> * utility predicate `ignore/2' added which simplifies the definitions
>   of ignore predicates needed in calls to `lex__init/3'
> * typeclass regexp(T) added, the definition of type regexp now is
hidden
>   from the user
> * function `->' added, which is a useful wrapper when defining lexemes
> * for more information about these changes see the mercury-reviews
>   thread starting at mercury-reviews/mercury-reviews.0107/0378.html
and
>   continued at mercury-reviews/mercury-reviews.0108/0028.html
> * common code factored out in predicates process_any_winner and
process_eof
> 
> Index: README
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/extras/lex/README,v
> retrieving revision 1.1
> diff -u -r1.1 README
> --- README	2001/02/21 16:29:36	1.1
> +++ README	2001/08/14 10:25:53
> @@ -1,12 +1,14 @@
>  lex 1.0 (very alpha)
>  Fri Aug 25 17:54:28  2000
>  Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
> -
>      THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
>      BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
>      BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
> -
> -
> +Sun Aug  5 16:15:27 UTC 2001
> +Copyright (C) 2001 The Rationalizer Intelligent Software AG
> +    The changes made by Rationalizer are contributed under the terms
> +    of the GNU Free Documentation License - see the file COPYING in
the
> +    Mercury Distribution.
> 
>  This package defines a lexer for Mercury.  There is plenty of scope
>  for optimization, however it is reasonably efficient and does provide
> @@ -23,25 +25,32 @@
> 
>      :- type token
>          --->    comment
> -        ;       id
> -        ;       num.
> +        ;       id(string)
> +        ;       num(int)
> +	;       space.

[Cosmetic: mixing tabs with spaces.]

> 
>  3. Set up a list of annotated_lexemes.
> 
>      Lexemes = [
> -        lexeme(noval(comment),      (atom('%'), star(dot))),
> -        lexeme(value(id),           identifier),
> -        lexeme(ignore,              whitespace)
> +	('%' >> *(dot))         -  return(comment),
> +	identifier              -  (func(Match) = id(Match)),
> +	signed_int              -  string__det_to_int,
> +	whitespace              -  return(space)
>      ]

This isn't quite right.  I think you mean

( "%" ++ *(dot)          -> return(comment) ),
( identifier             -> (func(Id) = id(Id)) ),
( signed_int             -> (func(N)  = num(string__det_to_int(N))) ),
( whitespace             -> return(space) )

Let me know if you think this is cosmetic and I'll change it
later.

> 
> -noval tokens are simply identified;
> -value tokens are identified and returned with the string matched;
> -ignore regexps are simply passed over.
> 
> +A lexeme is a pair (RegExp - TokFn) where RegExp is a
> +regular expression and TokFn is a token_creator function mapping
> +the string matched by RegExp to a token value.
> +
>  4. Set up a lexer with an appropriate read predicate (see the buf
module).
> 
>      Lexer = lex__init(Lexemes, lex__read_from_stdin)
> 
> +    or:
> +
> +    Lexer = lex__init(Lexemes, lex__read_from_stdin, ignore(space))
> +
>  5. Obtain a live lexer state.
> 
>      State0 = lex__start(Lexer, IO0)
> @@ -49,12 +58,21 @@
>  6. Use it to lex the input stream.
> 
>      lex__read(Result, State0, State1),
> -    ( Result = ok(NoValToken), ...
> -    ; Result = ok(ValueToken, String), ...
> -    ; Result = error(OffsetInInputStream), ...
> +    ( Result = ok(Token), ...
> +    ; Result = error(Message, OffsetInInputStream), ...
>      ; Result = eof, ...
>      )
> 
> +    NOTE: The result type of lex__read is io__read_result(token).
> +    io__read_result is documented in the library file io.m as:
> +    :- type io__read_result(T)      --->    ok(T)
> +                                    ;       eof
> +                                    ;       error(string, int).
> +                                            % error message, line
number
> +    In contrast to this the `int' lex returns in the case of an error
> +    does not correspond to the line number but to the character
offset.
> +    Hence be careful when processing lex errors.
> +
>  7. If you need to manipulate the source object, you can.
> 
>      lex__manipulate_source(io__print("Not finished yet?"), State1,
State2)
> @@ -69,7 +87,12 @@
>  and the option to write out a compilable source file for the lexer.
> 
> 
> +OPPORTUNITIES FOR MODULARIZATION
> 
> +1. Remove regexp functionality from lex.m and lex.regexp.m and put it
into
> +   a distinct regexp library.
> +
> +
>  OPPORTUNITIES FOR OPTIMIZATION
> 
>  1. Move from chars to bytes.
> @@ -78,4 +101,16 @@
>  3. Implement the first-byte optimization whereby the set of `live
lexemes'
>  is decided by the first byte read in on a lexing pass.
>  4. Implement state machine minimization (may or may not be
worthwhile.)
> +
> +
> +FEATURES TO ADD:
> +
> +1. Symbol table management (additional parameters for the
user-defined
> +   predicates containing the symbol table before and after processing
> +   a lexeme)
> +2. func (string) = regexp, where the function parameter contains a
> +   regexp definition in a form like used in languages in Perl, awk
etc.
> +3. line# as part of the offset
> +4. extend the lexer interface somehow to get more detailed
information
> +   about the token resp. error position
> 
> Index: lex.buf.m
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/extras/lex/lex.buf.m,v
> retrieving revision 1.1
> diff -u -r1.1 lex.buf.m
> --- lex.buf.m	2001/02/21 16:29:36	1.1
> +++ lex.buf.m	2001/08/14 10:25:53
> @@ -1,8 +1,9 @@
>  %
------------------------------------------------------------------------
---- %
> +% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
> +%
>  % lex.buf.m
>  % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
>  % Sat Aug 19 16:56:30 BST 2000
> -% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
>  %
>  %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
>  %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
> @@ -94,7 +95,7 @@
> 
>  :- interface.
> 
> -:- import_module int, array, char, bool, string, io.
> +:- import_module array, char, bool, string.
> 
> 
> 
> @@ -171,7 +172,6 @@
>  :- implementation.
> 
>  :- import_module exception.
> -
> 
> 
>      % The amount the buffer is grown by if (a) more space is
> Index: lex.convert_NFA_to_DFA.m
> ===================================================================
> RCS file:
/home/mercury1/repository/mercury/extras/lex/lex.convert_NFA_to_DFA.m,v
> retrieving revision 1.1
> diff -u -r1.1 lex.convert_NFA_to_DFA.m
> --- lex.convert_NFA_to_DFA.m	2001/02/21 16:29:36	1.1
> +++ lex.convert_NFA_to_DFA.m	2001/08/14 10:25:53
> @@ -1,8 +1,10 @@
>
%-----------------------------------------------------------------------
------
> +%
> +% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
> +%
>  % lex.convert_NFA_to_DFA.m
>  % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
>  % Fri Aug 18 12:30:25 BST 2000
> -% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
>  %
>  % Powerset construction used to transform NFAs into DFAs.
>  %
> Index: lex.lexeme.m
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/extras/lex/lex.lexeme.m,v
> retrieving revision 1.1
> diff -u -r1.1 lex.lexeme.m
> --- lex.lexeme.m	2001/02/21 16:29:36	1.1
> +++ lex.lexeme.m	2001/08/14 10:25:53
> @@ -1,16 +1,22 @@
>
%-----------------------------------------------------------------------
------
> +%
> +% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
> +%
>  % lex.lexeme.m
> -% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
>  % Sat Aug 19 08:22:32 BST 2000
> -% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
> +% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
> +%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
> +%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
> +%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
> +% Thu Jul 26 07:45:47 UTC 2001
> +% Copyright (C) 2001 The Rationalizer Intelligent Software AG
> +%   The changes made by Rationalizer are contributed under the terms
> +%   of the GNU Lesser General Public License.
>  %
>  % A lexeme combines a token with a regexp.  The lexer compiles
>  % lexemes and returns the longest successul parse in the input
>  % stream or an error if no match occurs.
>  %
> -%   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
> -%   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
> -%   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
>  %
>
%-----------------------------------------------------------------------
------ %
> 
> @@ -22,15 +28,17 @@
> 
>  :- type compiled_lexeme(T)
>      --->    compiled_lexeme(
> -                clxm_token              :: T,
> -                clxm_state              :: state_no,
> -                clxm_transition_map     :: transition_map
> +                token              :: token_creator(T),
> +                state              :: state_no,
> +                transition_map     :: transition_map
>              ).
> +:- inst compiled_lexeme(Inst)
> +    --->    compiled_lexeme(Inst, ground, ground).

Why does this inst need to be parameterized?  I would have
thought this would be sufficient:

:- inst compiled_lexeme
    --->    compiled_lexeme(func(in) = out is det, ground, ground).

> 
>  :- type transition_map
>      --->    transition_map(
> -                trm_accepting_states    :: bitmap,
> -                trm_rows                :: array(row)
> +                accepting_states    :: bitmap,
> +                rows                :: array(row)
>              ).
> 
>      % A transition row is an array of byte_transitions.
> @@ -59,12 +67,12 @@
>      % an accepting state_no.
>      %
>  :- pred next_state(compiled_lexeme(T), state_no, char, state_no,
bool).
> -:- mode next_state(in, in, in, out, out) is semidet.
> +:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
> 
>      % Succeeds iff a compiled_lexeme is in an accepting state_no.
>      %
> -:- pred in_accepting_state(compiled_lexeme(T)).
> -:- mode in_accepting_state(in) is semidet.
> +:- pred in_accepting_state(live_lexeme(T)).
> +:- mode in_accepting_state(in(live_lexeme)) is semidet.
> 
>
%-----------------------------------------------------------------------
------ %
>
%-----------------------------------------------------------------------
------ %
> @@ -77,7 +85,7 @@
>
%-----------------------------------------------------------------------
-------%
> 
>  compile_lexeme(Lexeme) = CompiledLexeme :-
> -    Lexeme         = lexeme(Token, RegExp),
> +    Lexeme         = (RegExp - TokenCreator),
>      NFA            = remove_null_transitions(regexp_to_NFA(RegExp)),
>      DFA            = convert_NFA_to_DFA(NFA),
>      StartState     = DFA ^ smc_start_state,
> @@ -87,7 +95,7 @@
>      Accepting      = set_accepting_states(StopStates, bitmap__new(N,
no)),
>      Rows           = array(set_up_rows(0, N, Transitions)),
>      TransitionMap  = transition_map(Accepting, Rows),
> -    CompiledLexeme = compiled_lexeme(Token, StartState,
TransitionMap).
> +    CompiledLexeme = compiled_lexeme(TokenCreator, StartState,
TransitionMap).
> 
>
%-----------------------------------------------------------------------
-------%
> 
> @@ -157,9 +165,11 @@
>
%-----------------------------------------------------------------------
-------%
> 
>  next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
> -    Rows            = CLXM ^ clxm_transition_map ^ trm_rows,
> -    AcceptingStates = CLXM ^ clxm_transition_map ^
trm_accepting_states,
> -    find_next_state(char__to_int(Char), Rows ^ elem(CurrentState),
NextState),
> +    Rows            = CLXM ^ transition_map ^ rows,
> +    AcceptingStates = CLXM ^ transition_map ^ accepting_states,
> +    find_next_state(char__to_int(Char),
> +				 Rows ^ elem(CurrentState),
> +				 NextState),
>      IsAccepting     = bitmap__get(AcceptingStates, NextState).
> 
>
%-----------------------------------------------------------------------
-------%
> @@ -189,7 +199,7 @@
> 
>  in_accepting_state(CLXM) :-
>      bitmap__is_set(
> -        CLXM ^ clxm_transition_map ^ trm_accepting_states, CLXM ^
clxm_state
> +        CLXM ^ transition_map ^ accepting_states, CLXM ^ state
>      ).
> 
>
%-----------------------------------------------------------------------
-------%
> Index: lex.m
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/extras/lex/lex.m,v
> retrieving revision 1.1
> diff -u -r1.1 lex.m
> --- lex.m	2001/02/21 16:29:36	1.1
> +++ lex.m	2001/08/14 10:25:54
> @@ -1,15 +1,19 @@
>
%-----------------------------------------------------------------------
------ %
> +% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
> +%
>  % lex.m
>  % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
>  % Sun Aug 20 09:08:46 BST 2000
> -% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
> -%
> -% This module puts everything together, compiling a list of lexemes
> -% into state machines and turning the input stream into a token
stream.
> -%
>  %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
>  %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
>  %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
> +% Thu Jul 26 07:45:47 UTC 2001
> +% Copyright (C) 2001 The Rationalizer Intelligent Software AG
> +%   The changes made by Rationalizer are contributed under the terms
> +%   of the GNU Lesser General Public License.
> +%
> +% This module puts everything together, compiling a list of lexemes
> +% into state machines and turning the input stream into a token
stream.
>  %
>
%-----------------------------------------------------------------------
------ %
> 
> @@ -18,37 +22,28 @@
>  :- interface.
> 
>  :- import_module std_util, string, char, list, io.
> -
> -
> 
> -:- type annotated_lexeme(Token)
> -    ==      lexeme(annotated_token(Token)).
> +:- type token_creator(Token)
> +    ==                        (func(string)=Token).
> +:- inst token_creator
> +    ==                        (func(in)=out is det).

Cosmetic: s/)=/) = /

> 
>  :- type lexeme(Token)
> -    --->    lexeme(
> -                lxm_token               :: Token,
> -                lxm_regexp              :: regexp
> -            ).
> -
> -:- type annotated_token(Token)
> -    --->    noval(Token)                % Just return ok(Token) on
success.
> -    ;       value(Token)                % Return ok(Token, String) on
success.
> -    ;       ignore.                     % Just skip over these
tokens.
> +    ==                        pair(regexp, token_creator(Token)).
> 
> +:- inst lexeme(Inst)
> +    ---> (ground - Inst).
> +
>  :- type lexer(Token, Source).
>  :- inst lexer
> -    ==      bound(lexer(ground, read_pred)).
> +    ---> lexer(
> +	       ground,
> +	       ignore_pred,
> +	       read_pred
> +	      ).

Cosmetic: could put all this on one line.

> 
>  :- type lexer_state(Token, Source).
> 
> -:- type lexer_result(Token)
> -    --->    ok(Token)                   % Noval token matched.
> -    ;       ok(Token, string)           % Value token matched.
> -    ;       eof                         % End of input.
> -    ;       error(int).                 % No matches for string at
this offset.
> -
> -
> -
>  :- type offset
>      ==      int.                        % Byte offset into the source
data.
> 
> @@ -66,27 +61,45 @@
>  :- inst read_pred
>      ==      ( pred(in, out, di, uo) is det ).
> 
> -
> +    % ignore_pred(Token): if it does not fail, Token must be ignored
> +    %
> +:- type ignore_pred(Tok)
> +    ==      pred(Tok).
> +:- inst ignore_pred
> +    ==      ( pred(in) is semidet ).
> 
>      % The type of regular expressions.
>      %
> -:- type regexp
> -    --->    null                % The empty regexp
> -    ;       atom(char)          % Match a single char
> -    ;       (regexp >> regexp)  % Concatenation
> -    ;       (regexp \/ regexp)  % Alternation
> -    ;       star(regexp)        % Kleene closure
> -    .
> +:- type regexp.
> 
> +    % The typeclass for types having a natural converter to regexp's
> +    %
> +:- typeclass regexp(T) where [
> +           func re(T) = regexp
> +].
> 
> +    % Handling regexp's based on the typeclass regexp(T)
> +    %
> +:- func null       = regexp.
> +:- func T1 ++ T2   = regexp  <= (regexp(T1), regexp(T2)).
> +    % one of the following two functions will declared be deprecated
> +    % later on, we still only have to decide which one
> +:- func T1 \/ T2   = regexp  <= (regexp(T1), regexp(T2)).
> +:- func (T1 or T2) = regexp  <= (regexp(T1), regexp(T2)).
> +:- func *(T)       = regexp  <= regexp(T).
> 
> +    % Some instances of typeclass regexp(T)
> +    %
> +:- instance regexp(regexp).
> +:- instance regexp(char).
> +:- instance regexp(string).
> +
>      % Some basic non-primitive regexps.
>      %
> -:- func str(string) = regexp.   % str("abc") = atom(a) >> atom(b) >>
atom(c)
> -:- func any(string) = regexp.   % any("abc") = atom(a) \/ atom(b) \/
atom(c)
> -:- func anybut(string) = regexp.% anybut("abc") is complement of
any("abc")
> -:- func opt(regexp) = regexp.   % opt(R)     = R \/ null
> -:- func plus(regexp) = regexp.  % plus(R)    = R \/ star(R)
> +:- func any(string) = regexp.        % any("abc") = ('a') \/ ('b') \/
('c')
> +:- func anybut(string) = regexp.     % anybut("abc") is complement of
any("abc")
> +:- func ?(T) = regexp <= regexp(T).  % ?(R)     = R \/ null
> +:- func +(T) = regexp <= regexp(T).  % +(R)    = R \/ *(R)
> 
>      % Some useful single-char regexps.
>      %
> @@ -95,34 +108,54 @@
>  :- func upper = regexp.         % upper      = any("ABC...Z")
>  :- func alpha = regexp.         % alpha      = lower \/ upper
>  :- func alphanum = regexp.      % alphanum   = alpha \/ digit
> -:- func identstart = regexp.    % identstart = alpha \/ str("_")
> -:- func ident = regexp.         % ident      = alphanum \/ str("_")
> -:- func nl = regexp.            % nl         = str("\n")
> -:- func tab = regexp.           % tab        = str("\t")
> -:- func spc = regexp.           % spc        = str(" ")
> +:- func identstart = regexp.    % identstart = alpha \/ "_"
> +:- func ident = regexp.         % ident      = alphanum \/ "_"
> +:- func nl = regexp.            % nl         = re("\n")
> +:- func tab = regexp.           % tab        = re("\t")
> +:- func spc = regexp.           % spc        = re(" ")
>  :- func wspc = regexp.          % wspc       = any(" \t\n\r\f\v")
>  :- func dot = regexp.           % dot        = any("<except \n>")

Cosmetic: s/any("<except \n>")/anybut("\n")/

> 
>      % Some useful compound regexps.
>      %
> -:- func nat = regexp.           % nat        = plus(digit)
> -:- func signed_int = regexp.    % signed_int = opt(any("+-")) >> nat
> +:- func nat = regexp.           % nat        = +(digit)
> +:- func signed_int = regexp.    % signed_int = ?(any("+-")) ++ nat
>  :- func real = regexp.          % real       =
\d+((.\d+([eE]int)?)|[eE]int)
> -:- func identifier = regexp.    % identifier = identstart >>
star(ident)
> -:- func whitespace = regexp.    % whitespace = star(wspc)
> -:- func junk = regexp.          % junk       = star(dot)
> -
> +:- func identifier = regexp.    % identifier = identstart ++ *(ident)
> +:- func whitespace = regexp.    % whitespace = *(wspc)
> +:- func junk = regexp.          % junk       = *(dot)
> +
> +   % Utility predicate to create ignore_pred's.
> +   % Use it in the form `ignore(my_token)' to ignore just `my_token'.
> +   %
> +:- pred ignore(Token::in, Token::in) is semidet.
> +
> +   % Utility function to return noval tokens.
> +   % Use it in the form `return(my_token) inside a lexeme definition.
> +   %
> +:- func return(T,string) = T.
> +
> +   % Utility operator to create lexemes.
> +:- func (T1 -> token_creator(Tok)) = pair(regexp,token_creator(Tok))
> +	<= regexp(T1).
> 
> -
>      % Construct a lexer from which we can generate running
>      % instances.
>      %
> -:- func init(list(annotated_lexeme(Tok)), read_pred(Src)) =
lexer(Tok, Src).
> +:- func init(list(lexeme(Tok)), read_pred(Src)) = lexer(Tok, Src).
>  :- mode init(in, in(read_pred)) = out(lexer) is det.
> 
> -    % Handy read predicates.
> +    % Construct a lexer from which we can generate running
> +    % instances. If we construct a lexer with init/4, we
> +    % can additionally ignore specific tokens.
>      %
> -:- pred read_from_stdin(offset, read_result, io__state, io__state).
> +:- func init(list(lexeme(Tok)), read_pred(Src), ignore_pred(Tok)) =
> +	lexer(Tok, Src).
> +:- mode init(in, in(read_pred), in(ignore_pred)) = out(lexer) is det.
> +
> +				% Handy read predicates.
> +    %
> +:- pred read_from_stdin(offset, read_result, io, io).
>  :- mode read_from_stdin(in, out, di, uo) is det.
> 
>  :- pred read_from_string(offset, read_result, string, string).
> @@ -135,7 +168,8 @@
>  :- func start(lexer(Tok, Src), Src) = lexer_state(Tok, Src).
>  :- mode start(in(lexer), di) = uo is det.
> 
> -:- pred read(lexer_result(Tok), lexer_state(Tok, Src),
lexer_state(Tok, Src)).
> +:- pred read(io__read_result(Tok), lexer_state(Tok, Src),
> +	     lexer_state(Tok, Src)).
>  :- mode read(out, di, uo) is det.
> 
>      % Stop a running instance of a lexer and retrieve the input
source.
> @@ -143,7 +177,7 @@
>  :- func stop(lexer_state(_Tok, Src)) = Src.
>  :- mode stop(di) = uo is det.
> 
> -    % Sometimes (e.g. when lexing the io__state) you want access to
the
> +    % Sometimes (e.g. when lexing the io__io) you want access to the
>      % input stream without interrupting the lexing process.  This
pred
>      % provides that sort of access.
>      %
> @@ -170,39 +204,64 @@
> 
>  :- type lexer(Token, Source)
>      --->    lexer(
> -                lex_compiled_lexemes    :: list(live_lexeme(Token)),
> -                lex_buf_read_pred       :: read_pred(Source)
> -            ).
> -
> -
> +		  lex_compiled_lexemes    :: list(live_lexeme(Token)),
> +		  lex_ignore_pred         :: ignore_pred(Token),
> +		  lex_buf_read_pred       :: read_pred(Source)
> +		 ).

Cosmetic: is the `lex_' prefix on the field names necessary?

> 
>  :- type lexer_instance(Token, Source)
>      --->    lexer_instance(
> -                lexi_init_lexemes       :: list(live_lexeme(Token)),
> -                lexi_live_lexemes       :: list(live_lexeme(Token)),
> -                lexi_current_winner     :: winner(Token),
> -                lexi_buf_state          :: buf_state(Source)
> -            ).
> -:- inst lexer_instance
> -    ==      bound(lexer_instance(ground, ground, ground, buf_state)).
> -
> +			   init_lexemes       ::
list(live_lexeme(Token)),
> +			   live_lexemes       ::
list(live_lexeme(Token)),
> +			   current_winner     :: winner(Token),
> +			   buf_state          :: buf_state(Source),
> +			   ignore_pred        :: ignore_pred(Token)
> +			  ).
> 
> +:- inst lexer_instance
> +    --->      lexer_instance(
> +			     live_lexeme_list,
> +                             live_lexeme_list,
> +	                     winner,
> +                             buf__buf_state,
> +			     ignore_pred
> +			    ).

Cosmetic: mixed spaces and tabs.

> 
>  :- type live_lexeme(Token)
> -    ==      compiled_lexeme(annotated_token(Token)).
> +    ==      compiled_lexeme(Token).
> +:- inst live_lexeme
> +    ==      compiled_lexeme(token_creator).
> +:- inst live_lexeme_list
> +    ==      list__list_skel(live_lexeme).
> 
> 
> 
>  :- type winner(Token)
> -    ==      maybe(pair(annotated_token(Token), offset)).
> +    ==      maybe(pair(token_creator(Token), offset)).
> +:- inst winner
> +    ---> yes(bound(token_creator - ground))
> +    ;    no.

Cosmetic: s/bound(token_creator - ground)/pair(token_creator, ground)/

The inst pair/2 is defined in std_util.

> 
>
%-----------------------------------------------------------------------
------ %
> +ignore(Tok,Tok).
> 
> -init(Lexemes, BufReadPred) = lexer(CompiledLexemes, BufReadPred) :-
> -    CompiledLexemes = list__map(lexeme__compile_lexeme, Lexemes).
>
+%----------------------------------------------------------------------
------- %
> +return(Token, _) = Token.
> 
>
%-----------------------------------------------------------------------
------ %
> +(R1 -> TC) = (re(R1) - TC).

Cosmetic: to be consistent with the formatting in the rest of the 
file, you should add a blank line after the line of dashes.


> 
>
+%----------------------------------------------------------------------
------- %
> +
> +init(Lexemes, BufReadPred) = init(Lexemes, BufReadPred,
DontIgnoreAnything) :-
> +    DontIgnoreAnything = ( pred(_::in) is semidet :- semidet_fail ).
> +
> +init(Lexemes, BufReadPred, IgnorePred) =
> +    lexer(CompiledLexemes, IgnorePred, BufReadPred) :-
> +    CompiledLexemes = list__map(compile_lexeme, Lexemes).
> +
>
+%----------------------------------------------------------------------
------- %
> +
>  start(Lexer, Src) = LexerState :-
>      init_lexer_instance(Lexer, Instance, Buf),
>      LexerState = args_lexer_state(Instance, Buf, Src).
> @@ -215,7 +274,9 @@
>  init_lexer_instance(Lexer, Instance, Buf) :-
>      buf__init(Lexer ^ lex_buf_read_pred, BufState, Buf),
>      InitLexemes = Lexer ^ lex_compiled_lexemes,
> -    Instance    = lexer_instance(InitLexemes, InitLexemes, no,
BufState).
> +    IgnorePred  = Lexer ^ lex_ignore_pred,
> +    Instance    = lexer_instance(InitLexemes, InitLexemes, no,
> +				 BufState, IgnorePred).
> 
>
%-----------------------------------------------------------------------
------ %
> 
> @@ -231,7 +292,7 @@
> 
> 
> 
> -:- pred read_0(lexer_result(Tok),
> +:- pred read_0(io__read_result(Tok),
>              lexer_instance(Tok, Src), lexer_instance(Tok, Src),
>              buf, buf, Src, Src).
>  :- mode read_0(out,
> @@ -243,7 +304,7 @@
>      %
>  read_0(Result, Instance0, Instance, Buf0, Buf, Src0, Src) :-
> 
> -    BufState0    = Instance0 ^ lexi_buf_state,
> +    BufState0    = Instance0 ^ buf_state,
> 
>      buf__read(BufReadResult, BufState0, BufState1, Buf0, Buf1, Src0,
Src1),
>      (
> @@ -259,7 +320,7 @@
> 
>
%-----------------------------------------------------------------------
------ %
> 
> -:- pred process_char(lexer_result(Tok), char,
> +:- pred process_char(io__read_result(Tok), char,
>              lexer_instance(Tok, Src), lexer_instance(Tok, Src),
>              buf_state(Src), buf, buf, Src, Src).
>  :- mode process_char(out, in, in(lexer_instance),
out(lexer_instance),
> @@ -268,8 +329,8 @@
>  process_char(Result, Char, Instance0, Instance,
>          BufState, Buf0, Buf, Src0, Src) :-
> 
> -    LiveLexemes0 = Instance0 ^ lexi_live_lexemes,
> -    Winner0      = Instance0 ^ lexi_current_winner,
> +    LiveLexemes0 = Instance0 ^ live_lexemes,
> +    Winner0      = Instance0 ^ current_winner,
> 
>      advance_live_lexemes(Char, buf__cursor_offset(BufState),
>              LiveLexemes0, LiveLexemes, Winner0, Winner),
> @@ -282,44 +343,44 @@
>          LiveLexemes = [_ | _],          % Still some open
possibilities.
> 
>          Instance1 = (((Instance0
> -                            ^ lexi_live_lexemes   := LiveLexemes)
> -                            ^ lexi_current_winner := Winner)
> -                            ^ lexi_buf_state      := BufState),
> +                            ^ live_lexemes   := LiveLexemes)
> +                            ^ current_winner := Winner)
> +                            ^ buf_state      := BufState),
>          read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
>      ).
> 
>
%-----------------------------------------------------------------------
------ %
> 
> -:- pred process_any_winner(lexer_result(Tok), winner(Tok),
> +:- pred process_any_winner(io__read_result(Tok), winner(Tok),
>              lexer_instance(Tok, Src), lexer_instance(Tok, Src),
>              buf_state(Src), buf, buf, Src, Src).
> -:- mode process_any_winner(out, in,
> +:- mode process_any_winner(out, in(winner),
>              in(lexer_instance), out(lexer_instance),
>              in(buf_state), array_di, array_uo, di, uo) is det.
> 
> -process_any_winner(Result, yes(ATok - Offset), Instance0, Instance,
> +process_any_winner(Result, yes(TokenCreator - Offset), Instance0,
Instance,
>          BufState0, Buf0, Buf, Src0, Src) :-
> 
>      BufState1 = buf__rewind_cursor(Offset, BufState0),
>      Instance1 = ((( Instance0
> -                        ^ lexi_live_lexemes   := Instance0 ^
lexi_init_lexemes)
> -                        ^ lexi_current_winner := no)
> -                        ^ lexi_buf_state      :=
buf__commit(BufState1)),
> +                        ^ live_lexemes   := Instance0 ^ init_lexemes)
> +                        ^ current_winner := no)
> +                        ^ buf_state      := buf__commit(BufState1)),
>      (
> -        ATok     = noval(Token),
> -        Result   = ok(Token),
> -        Instance = Instance1,
> -        Buf      = Buf0,
> -        Src      = Src0
> -    ;
> -        ATok     = value(Token),
> -        Result   = ok(Token, buf__string_to_cursor(BufState1, Buf)),
> -        Instance = Instance1,
> -        Buf      = Buf0,
> -        Src      = Src0
> -    ;
> -        ATok     = ignore,
> -        read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
> +      if
> +
> +         get_token_from_buffer(BufState1, Buf0, Instance0,
> +			       TokenCreator, Token)
> +      then
> +
> +         Result   = ok(Token),
> +         Instance = Instance1,
> +         Buf      = Buf0,
> +         Src      = Src0
> +
> +      else
> +
> +         read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
>      ).
> 
>  process_any_winner(Result, no, Instance0, Instance,
> @@ -327,16 +388,16 @@
> 
>      Start     = buf__start_offset(BufState0),
>      BufState1 = buf__rewind_cursor(Start + 1, BufState0),
> -    Result    = error(Start),
> +    Result    = error("input not matched by any regexp", Start),
>      Instance  = ((( Instance0
> -                        ^ lexi_live_lexemes   :=
> -                                Instance0 ^ lexi_init_lexemes)
> -                        ^ lexi_current_winner := no)
> -                        ^ lexi_buf_state      :=
buf__commit(BufState1)).
> +                        ^ live_lexemes   :=
> +                                Instance0 ^ init_lexemes)
> +                        ^ current_winner := no)
> +                        ^ buf_state      := buf__commit(BufState1)).
> 
>
%-----------------------------------------------------------------------
------ %
> 
> -:- pred process_eof(lexer_result(Tok),
> +:- pred process_eof(io__read_result(Tok),
>              lexer_instance(Tok, Src), lexer_instance(Tok, Src),
>              buf_state(Src), buf).
>  :- mode process_eof(out, in(lexer_instance), out(lexer_instance),
> @@ -346,44 +407,63 @@
> 
>      ( if
> 
> -        live_lexeme_in_accepting_state(Instance0 ^ lexi_live_lexemes,
ATok)
> +        live_lexeme_in_accepting_state(Instance0 ^ live_lexemes,
TokenCreator)
> 
>        then
> 
>              % Return the token and set things up so that we return
>              % eof next.
> -        (
> -            ATok   = noval(Token),
> -            Result = ok(Token)
> -        ;
> -            ATok   = value(Token),
> -            Result = ok(Token, buf__string_to_cursor(BufState, Buf))
> -        ;
> -            ATok   = ignore,
> -            Result = eof
> -        )
> +            %
> +	    (
> +	      if
> +
> +	         get_token_from_buffer(BufState, Buf, Instance0,
> +				       TokenCreator, Token)
> +	      then
> +
> +	         Result = ok(Token)
> +
> +	      else
> 
> +	         Result = eof
> +	    )
> +
>        else
> 
>          Result     = eof
>      ),
>      Instance  = ((Instance0
> -                        ^ lexi_live_lexemes := [])
> -                        ^ lexi_buf_state    :=
buf__commit(BufState)).
> +                        ^ live_lexemes := [])
> +                        ^ buf_state    := buf__commit(BufState)).
> +
>
+%----------------------------------------------------------------------
------- %
> +
> +:- pred get_token_from_buffer(buf_state(Src), buf,
lexer_instance(Tok, Src),
> +			      token_creator(Tok), Tok).
> +:- mode get_token_from_buffer(in(buf_state), array_ui,
in(lexer_instance),
> +			      in(token_creator), out) is semidet.
> +
> +get_token_from_buffer(BufState, Buf, Instance, TokenCreator, Token)
:-
> +    Match     = buf__string_to_cursor(BufState, Buf),
> +    Token     = TokenCreator(Match),
> +    IgnorePred = Instance ^ ignore_pred,
> +    \+ IgnorePred(Token).

Cosmetic: I think I used `not' rather than `\+' in the rest of the file.

> 
>
%-----------------------------------------------------------------------
------ %
> 
>  :- pred advance_live_lexemes(char, offset,
>              list(live_lexeme(Token)), list(live_lexeme(Token)),
>              winner(Token), winner(Token)).
> -:- mode advance_live_lexemes(in, in, in, out, in, out) is det.
> +:- mode advance_live_lexemes(in, in, in(live_lexeme_list),
> +                             out(live_lexeme_list),
> +			     in(winner), out(winner)) is det.
> 
>  advance_live_lexemes(_Char, _Offset, [], [], Winner, Winner).
> 
>  advance_live_lexemes(Char, Offset, [L0 | Ls0], Ls, Winner0, Winner)
:-
> 
> -    State0        = L0 ^ clxm_state,
> -    ATok          = L0 ^ clxm_token,
> +    State0        = L0 ^ state,
> +    ATok          = L0 ^ token,
> 
>      ( if next_state(L0, State0, Char, State, IsAccepting) then
> 
> @@ -395,7 +475,7 @@
>              Winner1     = yes(ATok - Offset)
>          ),
>          advance_live_lexemes(Char, Offset, Ls0, Ls1, Winner1,
Winner),
> -        Ls = [( L0 ^ clxm_state := State ) | Ls1]
> +        Ls = [( L0 ^ state := State ) | Ls1]
> 
>        else
> 
> @@ -405,15 +485,17 @@
>
%-----------------------------------------------------------------------
------ %
> 
>  :- pred live_lexeme_in_accepting_state(list(live_lexeme(Tok)),
> -                annotated_token(Tok)).
> -:- mode live_lexeme_in_accepting_state(in, out) is semidet.
> +                token_creator(Tok)).
> +:- mode live_lexeme_in_accepting_state(in(live_lexeme_list),
> +	                  out(token_creator)) is semidet.
> 
>  live_lexeme_in_accepting_state([L | Ls], Token) :-
>      ( if in_accepting_state(L)
> -      then Token = L ^ clxm_token
> +      then Token = L ^ token
>        else live_lexeme_in_accepting_state(Ls, Token)
>      ).
> 
> +
>
%-----------------------------------------------------------------------
------ %
>
%-----------------------------------------------------------------------
------ %
> 
> @@ -423,10 +505,10 @@
> 
>  :- type lexer_state(Tok, Src)
>      --->    lexer_state(
> -                lxr_instance            :: lexer_instance(Tok, Src),
> -                lxr_buf                 :: buf,
> -                lxr_src                 :: Src
> -            ).
> +			run                 :: lexer_instance(Tok, Src),
> +			buf                 :: buf,
> +			src                 :: Src
> +		       ).

Cosmetic: mixed spaces and tabs.

> 
>
%-----------------------------------------------------------------------
-------%
> 
> @@ -471,17 +553,49 @@
>          Result = eof
>      ).
> 
> +
>
%-----------------------------------------------------------------------
------ %
> -% Some basic non-primitive regexps.
> +% The type of regular expressions.
> +:- type regexp
> +    --->    eps                    % The empty regexp
> +    ;       atom(char)             % Match a single char
> +    ;       conc(regexp,regexp)    % Concatenation
> +    ;       alt(regexp, regexp)    % Alternation
> +    ;       star(regexp)           % Kleene closure
> +    .
> 
> -str(S) = R :-
>
+%----------------------------------------------------------------------
------- %

Cosmetic: add a blank line here.

> +% Some instances of typeclass regexp(T)
> +:- instance regexp(regexp) where [
> +	  re(RE) = RE
> +].
> +
> +:- instance regexp(char) where [
> +	  re(C) = atom(C)
> +].
> +
> +:- instance regexp(string) where [
> +	  re(S) =  R :-
>      ( if S = "" then
>          R = null
>        else
>          L = string__length(S),
>          C = string__index_det(S, L - 1),
> -        R = str_foldr(func(Cx, Rx) = (atom(Cx) >> Rx), S, atom(C), L
- 2)
> -    ).
> +        R = str_foldr(func(Cx, Rx) = (Cx ++ Rx), S, re(C), L - 2)
> +    )
> +].
> +
> +
>
+%----------------------------------------------------------------------
------- %
> +% Basic primitive regexps.
> +null = eps.
> +R1 ++ R2   = conc(re(R1), re(R2)).
> +R1 \/ R2   = alt(re(R1), re(R2)).
> +(R1 or R2) = alt(re(R1), re(R2)).
> +*(R1)      = star(re(R1)).
> +
>
+%----------------------------------------------------------------------
------- %
> +% Some basic non-primitive regexps.
> 
>  any(S) = R :-
>      ( if S = "" then
> @@ -489,7 +603,7 @@
>        else
>          L = string__length(S),
>          C = string__index_det(S, L - 1),
> -        R = str_foldr(func(Cx, Rx) = (atom(Cx) \/ Rx), S, atom(C), L
- 2)
> +        R = str_foldr(func(Cx, Rx) = (Cx \/ Rx), S, re(C), L - 2)
>      ).
> 
>  anybut(S0) = R :-
> @@ -511,9 +625,9 @@
>                 else str_foldr(Fn, S, Fn(string__index_det(S, I), X),
I - 1)
>      ).
> 
> -opt(R)  = (R \/ null).
> +?(R)  = (R \/ null).
> 
> -plus(R) = (R >> star(R)).
> ++(R) = (R ++ *(R)).
> 
>
%-----------------------------------------------------------------------
------ %
>  % Some useful single-char regexps.
> @@ -535,24 +649,24 @@
>  dot        = anybut("\n").
>  alpha      = (lower \/ upper).
>  alphanum   = (alpha \/ digit).
> -identstart = (alpha \/ atom('_')).
> -ident      = (alphanum \/ atom('_')).
> -nl         = atom('\n').
> -tab        = atom('\t').
> -spc        = atom(' ').
> +identstart = (alpha \/ ('_')).
> +ident      = (alphanum \/ ('_')).
> +nl         = re('\n').
> +tab        = re('\t').
> +spc        = re(' ').
> 
>
%-----------------------------------------------------------------------
------ %
>  % Some useful compound regexps.
> 
> -nat        = plus(digit).
> -signed_int = (opt(any("+-")) >> nat).
> -real       = (signed_int >> (
> -                (atom('.') >> nat >> opt(any("eE") >> signed_int)) \/
> -                (any("eE") >> signed_int)
> +nat        = +(digit).
> +signed_int = (?(any("+-")) ++ nat).
> +real       = (signed_int ++ (
> +                ('.' ++ nat ++ ?(any("eE") ++ signed_int)) \/
> +                (any("eE") ++ signed_int)

Cosmetic: now that chars are in the regexp typeclass we might write
the above more clearly as

signed_int = ?("+" or "-") ++ nat.
real       = signed_int ++ (
               ("." ++ nat ++ ?((e or E) ++ signed_int)) or
               (                (e or E) ++ signed_int )
             ).

>               )).
> -identifier = (identstart >> star(ident)).
> -whitespace = star(wspc).
> -junk       = star(dot).
> +identifier = (identstart ++ *(ident)).
> +whitespace = *(wspc).
> +junk       = *(dot).
> 
>
%-----------------------------------------------------------------------
------ %
>
%-----------------------------------------------------------------------
------ %
> Index: lex.regexp.m
> ===================================================================
> RCS file: /home/mercury1/repository/mercury/extras/lex/lex.regexp.m,v
> retrieving revision 1.1
> diff -u -r1.1 lex.regexp.m
> --- lex.regexp.m	2001/02/21 16:29:37	1.1
> +++ lex.regexp.m	2001/08/14 10:25:54
> @@ -1,15 +1,19 @@
>
%-----------------------------------------------------------------------
------ %
> -% lex.regexp.m
> -% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
> -% Fri Aug 18 06:43:09 BST 2000
>  % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
> -%
> -% Converts basic regular expressions into non-deterministic finite
> -% automata (NFAs).
>  %
> +% lex.regexp.m
> +% Fri Aug 18 06:43:09 BST 2000
> +% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
>  %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
>  %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
>  %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
> +% Thu Jul 26 07:45:47 UTC 2001
> +% Copyright (C) 2001 The Rationalizer Intelligent Software AG
> +%   The changes made by Rationalizer are contributed under the terms
> +%   of the GNU Lesser General Public License.
> +%
> +% Converts basic regular expressions into non-deterministic finite
> +% automata (NFAs).
>  %
>
%-----------------------------------------------------------------------
------ %
> 
> @@ -53,16 +57,16 @@
> 
>      % The primitive regexps.
> 
> -compile(X, null,       Y, [null(X, Y)]) --> [].
> +compile(X, eps,        Y, [null(X, Y)]) --> [].
> 
>  compile(X, atom(C),    Y, [trans(X, C, Y)]) --> [].
> 
> -compile(X, (RA >> RB), Y, TsA ++ TsB) -->
> +compile(X, conc(RA,RB), Y, TsA ++ TsB) -->
>      counter__allocate(Z),
>      compile(X, RA, Z, TsA),
>      compile(Z, RB, Y, TsB).
> 
> -compile(X, (RA \/ RB), Y, TsA ++ TsB) -->
> +compile(X, alt(RA,RB), Y, TsA ++ TsB) -->

Cosmetic: s/alt(RA,RB)/alt(RA, RB)/

>      compile(X, RA, Y, TsA),
>      compile(X, RB, Y, TsB).
> 
> Index: samples/demo.m
> ===================================================================
> RCS file:
/home/mercury1/repository/mercury/extras/lex/samples/demo.m,v
> retrieving revision 1.1
> diff -u -r1.1 demo.m
> --- samples/demo.m	2001/02/21 16:29:39	1.1
> +++ samples/demo.m	2001/08/14 10:25:54
> @@ -1,13 +1,18 @@
>
%-----------------------------------------------------------------------
------ %
>  % demo.m
> -% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
>  % Sun Aug 20 18:11:42 BST 2000
> -% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
> -%
> +% Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
>  %   THIS FILE IS HEREBY CONTRIBUTED TO THE MERCURY PROJECT TO
>  %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
>  %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
> +% Thu Jul 26 07:45:47 UTC 2001
> +% Copyright (C) 2001 The Rationalizer Intelligent Software AG
> +%   The changes made by Rationalizer are contributed under the terms
> +%   of the GNU General Public License - see the file COPYING in the
> +%   Mercury Distribution.
>  %
> +% vim: ts=4 sw=4 et tw=0 wm=0 ff=unix ft=mercury
> +%
>
%-----------------------------------------------------------------------
------ %
> 
>  :- module demo.
> @@ -23,7 +28,7 @@
> 
>  :- implementation.
> 
> -:- import_module char, string, exception, array, list.
> +:- import_module char, string, exception, array, list, std_util, int.
>  :- import_module lex.
> 
>
%-----------------------------------------------------------------------
------ %
> @@ -40,7 +45,7 @@
> 
>  "),
> 
> -    { Lexer  = lex__init(lexemes, lex__read_from_stdin) },
> +    { Lexer  = lex__init(lexemes, lex__read_from_stdin,
ignore(space)) },
>      call((pred(IO0::di, IO::uo) is det :-
>              State0 = lex__start(Lexer, IO0),
>              tokenise_stdin(State0, State),
> @@ -66,48 +71,53 @@
>
%-----------------------------------------------------------------------
------ %
> 
>  :- type token
> -    --->    noun
> -    ;       comment
> -    ;       integer
> -    ;       real
> -    ;       verb
> -    ;       conj
> -    ;       prep
> +    --->    noun(string)
> +    ;       comment(string)
> +    ;       integer(int)
> +    ;       real(float)
> +    ;       verb(string)
> +    ;       conj(string)
> +    ;       prep(string)
>      ;       punc
> +    ;       space
>      .
> 
> -:- func lexemes = list(annotated_lexeme(token)).
> +:- func lexemes = list(lexeme(token)).
> 
>  lexemes = [
> -    lexeme(value(comment),
> -			(atom('%') >> junk)),
> -
> -    lexeme(value(integer),
> -			(signed_int)),
> -
> -    lexeme(value(real),
> -            (real)),
> -
> -    lexeme(value(noun), str("cat")),
> -    lexeme(value(noun), str("dog")),
> -    lexeme(value(noun), str("rat")),
> -    lexeme(value(noun), str("mat")),
> -
> -    lexeme(value(verb),
> -			(str("sat") \/ str("caught") \/ str("chased"))),
> +    ("%" ++ junk)        -  (func(Match) = comment(Match)) ,
> +    (signed_int)         -  (func(Match) =
integer(string__det_to_int(Match))),
> +    (real)               -  (func(Match) =
real(det_string_to_float(Match))) ,
> +    re("cat")            -  (func(Match) = noun(Match)) ,
> +    re("dog")            -  (func(Match) = noun(Match)) ,
> +    % Using `->' instead of `-' it's not necessary to call `re':
> +    "rat"                -> (func(Match) = noun(Match)) ,
> +    "mat"                -> (func(Match) = noun(Match)) ,
> +    % Here we use `or'
> +    ("sat" or "caught" or "chased") -
> +	                    (func(Match) = verb(Match)) ,
> +    ("and" or "then")    -
> +	                    (func(Match) = conj(Match)) ,
> +    % Now we use `\/', it's the same as `or'. We would like to
> +    % know from you, which one looks nicer.
> +    ("the" \/ "it" \/ "them" \/ "to" \/ "on") -
> +	                    (func(Match) = prep(Match)) ,
> +    any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./") -
> +                            return(punc) ,
> +
> +    whitespace           -  return(space)
> +].
> 

Cosmetic?  Might be better to just encourage people to use `->'
rather than `-'.  Note that getting the precedence right means
putting parentheses around each `RegExp -> TokenCreator' pair.

> -    lexeme(value(conj),
> -			(str("and") \/ str("then"))),
> +:- func det_string_to_float(string) = float.
> 
> -    lexeme(value(prep),
> -			(str("the") \/ str("it") \/ str("them") \/
str("to") \/ str("on"))),
> +:- import_module require.
> 
> -    lexeme(noval(punc),
> -			(any("~!@#$%^&*()_+`-={}|[]\\:"";'<>?,./"))),
> +det_string_to_float(String) = Float :-
> +	string__to_float(String,Float)
> +        ;
> +	error("Floating point number overflow").

This is not deterministic: you need an if-then-else rather than a
disjuction:

det_string_to_float(String) =
    ( if   string__to_float(String, Float)
      then Float
      else error("...")
    ).

> 
> -    lexeme(ignore,
> -			whitespace)
> -].
> -
>
%-----------------------------------------------------------------------
------ %
>
%-----------------------------------------------------------------------
------ %
> +
> +
>
------------------------------------------------------------------------
--
> mercury-reviews mailing list
> post:  mercury-reviews at cs.mu.oz.au
> administrative address: owner-mercury-reviews at cs.mu.oz.au
> unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message:
unsubscribe
> subscribe:   Address: mercury-reviews-request at cs.mu.oz.au Message:
subscribe
>
------------------------------------------------------------------------
--
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at cs.mu.oz.au
administrative address: owner-mercury-reviews at cs.mu.oz.au
unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list