[m-rev.] lex and moose changed

Ralph Becket rbeck at microsoft.com
Thu Aug 2 19:40:32 AEST 2001


> From: Holger Krug [mailto:hkrug at rationalizer.com] 
> Sent: 02 August 2001 08:07
> 
> Great proposal. Thanks ! I'll try to answer for both sides and allow
> Ralph to kill me (but only within 3 hours after having read this email
> and only personally, so he had to come to Berlin within 3 hours) if
> there's some wrongdoing concerning his approach.

Damn, I'll never get a flight now!

> Ralph's solution (code prefixed with line numbers):
> ---------------------------------------------------
> 
> % the token type does not contain the values, only the token names
> 01 :- type token --->
> 02         plus ; minus ; times ; divide ; int ; ident.
> 
> 03 :- func lexemes = list(annotated_lexeme(token)).
> 
> 04 lexemes = [ % lexeme(annotated token, regular expression)
> 05             lexeme(noval(plus) , atom('+') ),
> 06             lexeme(noval(minus) , atom('-') ),
> 07             lexeme(noval(times) , atom('*') ),
> 08             lexeme(noval(divide) , atom('/')),
> 09             lexeme(value(int) , signed_int ),
> 10             lexeme(value(ident) , identifier )
> 11           ]

This is clean, IMHO: each regexp is named and there is minimal
processing clutter around.

> 12 :- pred my_read(parser_input(PTok), 
> 13                 lexer_state(Tok, Src), lexer_state(Tok, Src)).
> 14 :- mode my_read(out, di, uo) is det.
>
> 15 :- type parser_token --->
> 16  		plus ; minus ; times ; divide ; int(int) ;
ident(string).
> 
> % to be defined by the parser framework, e.g. moose:
> :- type parser_input(PTok) ---> ok(PTok)
>                            ;    eof
>                            ;    error(int).
> 
> 17 :- func convert(token, string) = parser_input(parser_token).
Filling in we get

17a convert(plus)      = ok(plus).
17b convert(minus)     = ok(minus).
17c convert(times)     = ok(times).
17d convert(divide)    = ok(divide).
17e convert(int, IStr) = ( if doesn't overflow then ok(N) else
error(...) ).
17f convert(ident, Id) = ok(Id).

This is fine.  I can return error tokens to the parser this way.  If I
didn't need to do that I could get rid of the ok(_) wrappers and instead
do that in my_read//1 below.

I suppose the duplication could be a source of bugs.  It's certainly
less attractive than having just one token type.

> 18 my_read(T) -->
> 19        lex__read(Result),
> 20        {       Result = error(_), ...
> 21        ;       Result = eof, ...
> 22        ;       Result = ok(Token),         T = convert(Token, "")
> 23        ;       Result = ok(Token, String), T = convert(Token,
String)
> 24        }.

> My solution:
> ------------
> 
> % The token type is like you propose:
> 01 :- type token --->
> 02               plus ; minus ; times ; divide ; int(int) ;
ident(string).
> 
> 03 lexemes =[ % lexeme( noval(token), regular expression )
>               % lexeme( t(func(string) = token), regular expression )
> 04            lexeme(noval(plus) , atom('+') ),
> 05            lexeme(noval(minus) , atom('-') ),
> 06            lexeme(noval(times) , atom('*') ),
> 07            lexeme(noval(divide) , atom('/')),
> 08            lexeme(t(func(Match) = 
> 09                
> int(convert_signed_digits_to_string(Match))), signed_int ),
> 10            lexeme(t(func(Match) = ident(Match))) , signed_int )
> 11          ]

I feel there's a little bit too much going on here, but looking at it
things don't seem nearly as bad as I'd imagined.

I've got some minor suggestions for improving the syntax if this is
the preferred option.

Rather than the somewhat bulky lexeme(AnnotatedToken, RegExp) we
might want to move to a more traditional looking form, something
like

lexemes = [
	atom('+')         - return(plus),
      atom('-')         - return(minus),
	...
	signed_int        - (func(IStr) =
int(string__det_to_int(IStr))),
      identifier        - (func(Id)   = ident(Id))
].

Where we have return/2 as a utility function:

:- func return(T, string) = T.
return(Token, _) = Token.

This would require extracting the string component for every matched
regexp, but that's surely small beer in performance terms.

Okay, I'm sold on the idea!

While we're at it, I suggest changing the regex operator names
{/\, \/, star, plus, opt} to {++, or, *, +, ?} respectively.

> % A problem here, I admit, is that in the current 
> implementation there 
> % is no way to signal an error at this  place in the case of e.g. an 
> % integer overflow. This has to be added to the implementation. 
> % (An exception may be thrown inside the converter. If the converter 
> % throws an exception, `lex__read' catches the exception and returns 
> % the appropriated error message. This all is transparent to 
> the user, 
> % I only forgot to implement it.)

Yes, I'd go along with that.

- Ralph
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at cs.mu.oz.au
administrative address: owner-mercury-reviews at cs.mu.oz.au
unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list