[m-rev.] lex and moose changed

Holger Krug hkrug at rationalizer.com
Thu Aug 2 17:07:01 AEST 2001
Previous message: [m-rev.] lex and moose changed
Next message: [m-rev.] lex and moose changed
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Aug 02, 2001 at 10:58:17AM +1000, Peter Schachte wrote:
> Ralph and Holger:
> 
> Maybe the best way to resolve this is for each of you to post a simpler
> lexer using your preferred scheme.  Hopefully on seeing both, the preferable
> approach will become clear.
> 
> As a suggested application, here's a token type I'd like to have produced by
> the lexer:
> 
> 	:- type token --->
> 		plus ; minus ; times ; divide ; int(int) ; ident(string).
> 
> Hopefully this is simple enough to be easy to code, and complex enough to
> illustrate the differences between the approaches.

Great proposal. Thanks ! I'll try to answer for both sides and allow
Ralph to kill me (but only within 3 hours after having read this email
and only personally, so he had to come to Berlin within 3 hours) if
there's some wrongdoing concerning his approach.

Ralph's solution (code prefixed with line numbers):
---------------------------------------------------

% the token type does not contain the values, only the token names
01 :- type token --->
02         plus ; minus ; times ; divide ; int ; ident.

03 :- func lexemes = list(annotated_lexeme(token)).

04 lexemes = [ % lexeme(annotated token, regular expression)
05             lexeme(noval(plus) , atom('+') ),
06             lexeme(noval(minus) , atom('-') ),
07             lexeme(noval(times) , atom('*') ),
08             lexeme(noval(divide) , atom('/')),
09             lexeme(value(int) , signed_int ),
10             lexeme(value(ident) , identifier )
11           ]

% Using lexemes we now create a lexer, to which a character source may be
% added and from where tokens may be received. Because here is no difference
% I won't show the code. The difference only lies in the result which is
% of type `lexler_result(token)' and is obtained by `lex__read':

:- type lex__lexer_result(Token)
    --->    ok(Token)                   % Noval token matched.
    ;       ok(Token, string)           % Value token matched.
    ;       eof                         % End of input.
    ;       error(int).                 % No matches for string at this offset.

:- pred lex__read(lexer_result(Tok), 
	          lexer_state(Tok, Src), lexer_state(Tok, Src)).
:- mode lex__read(out, di, uo) is det.

% `lexler_result(token)' gives the token and, in the case of a value token,
% the string matched by the regular expression associated with the token
% value in the lexeme. To avoid to have the parser to tackle with strings
% when they really represent integers or something else, Ralph's approach 
% now is to wrap `lex__read' with `my_read' which does the conversion

12 :- pred my_read(parser_input(PTok), 
13                 lexer_state(Tok, Src), lexer_state(Tok, Src)).
14 :- mode my_read(out, di, uo) is det.

% Here the output type `parser_input(PTok)' is added by me. `PTok' would be
% the kind of tokens as the parser expects them. `PTok' has to be defined
% additionally because it's not the same as `token'. E.g.:

% to be defined by the user:
15 :- type parser_token --->
16  		plus ; minus ; times ; divide ; int(int) ; ident(string).

% to be defined by the parser framework, e.g. moose:
:- type parser_input(PTok) ---> ok(PTok)
                           ;    eof
                           ;    error(int).

17 :- func convert(token, string) = parser_input(parser_token).

18 my_read(T) -->
19        lex__read(Result),
20        {       Result = error(_), ...
21        ;       Result = eof, ...
22        ;       Result = ok(Token),         T = convert(Token, "")
23        ;       Result = ok(Token, String), T = convert(Token, String)
24        }.

% `convert' does the conversion from lexer output to parser input, e.g.
% from strings to ints. The parser now may be fed with `my_read' as 
% token source.

My solution:
------------

% The token type is like you propose:
01 :- type token --->
02               plus ; minus ; times ; divide ; int(int) ; ident(string).

% The lexemes are more difficult as in Ralph's approach. This may be changed
% later on when we will have `lex' not only as library, but as a real
% code generator, as Ralph suggests in his to-do list. 

03 lexemes =[ % lexeme( noval(token), regular expression )
              % lexeme( t(func(string) = token), regular expression )
04            lexeme(noval(plus) , atom('+') ),
05            lexeme(noval(minus) , atom('-') ),
06            lexeme(noval(times) , atom('*') ),
07            lexeme(noval(divide) , atom('/')),
08            lexeme(t(func(Match) = 
09                int(convert_signed_digits_to_string(Match))), signed_int ),
10            lexeme(t(func(Match) = ident(Match))) , signed_int )
11          ]

12 :- func convert_signed_digits_to_string(string) = token.

% As you see, the conversion is part of the lexeme definition, based on
% which the lexer is created. `convert_signed_digits_to_string' is
% `string__det_to_int' with overflow check. 

% A problem here, I admit, is that in the current implementation there 
% is no way to signal an error at this  place in the case of e.g. an 
% integer overflow. This has to be added to the implementation. 
% (An exception may be thrown inside the converter. If the converter 
% throws an exception, `lex__read' catches the exception and returns 
% the appropriated error message. This all is transparent to the user, 
% I only forgot to implement it.)

% The library function `lex__read' now does the same, as `my_read'
% in Ralph's case. Ralph has to declare two Token types, one for the
% lexer (only tokens without values) and another one, containing the
% necessary values after conversion, for the parser.
% In my case the token types for lexer and parser may be the same.

% I.e. we are ready now to feed the parser with `lex__read' as its input
% source. The type for parser input is the same of the type for lexer
% output, viz. `lexer_result(token)'.

Result
------

24 lines of code to be written for Ralph's approach (+ implementation of
   `convert' and the dots `...' in `my_read')
12 lines of code to be written for my approach (+ implementation of
   `convert_signed_digits_to_string')


-- 
Holger Krug
hkrug at rationalizer.com
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at cs.mu.oz.au
administrative address: owner-mercury-reviews at cs.mu.oz.au
unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------
Previous message: [m-rev.] lex and moose changed
Next message: [m-rev.] lex and moose changed
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the reviews mailing list