[m-dev.] For review: general purpose lexer module

Fergus Henderson fjh at cs.mu.OZ.AU
Sat Feb 10 15:45:23 AEDT 2001


On 09-Feb-2001, Ralph Becket <rbeck at microsoft.com> wrote:
> This is a mostly efficient lexer module I wrote that will handle
> arbitrary regular expressions and is easy to use.

At first glance that looks good.

> I propose that, if accepted, this should go in the standard library
> since it is of general utility.

Hmm, perhaps... I'd be inclined to put it in `extras' for now.
If it gets used a lot or if there is much demand for it then
we can move it into the standard library, as was done e.g. for
the `exceptions' module.  It's much harder to take things out
of the standard library than to put them in, so we should be
conservative about doing that...


When adding stuff to `extras', you should modify extras/README
and RELEASE_NOTES to mention the new stuff you've added.

> :- module test_lex.

> main -->
...
>     { Kludge =
>         ( pred(IO0::di, IO::uo) is det :-
>             State0 = lex__start(Lexer, IO0),
>             tokenise_stdin(State0, State),
>             IO     = lex__stop(State)
>         )
>     },
>     Kludge.

I suggest writing that as

	main -->
		...
		call((pred(IO0::di, IO::uo) is det :-
			State0 = lex__start(Lexer, IO0),
			tokenise_stdin(State0, State),
			IO     = lex__stop(State)
		)).

lex.m:
>     % Stop a running instance of a lexer and retrieve the input source.
>     %
>     % XXX Er, the mode isn't quite right.  The result won't be unique
>     % if the input source wasn't.  This is the price one pays for
>     % not having nested unique modes and polymorphic modes yet.
>     %
> :- func stop(lexer_state(_Tok, Src)) = Src.
> :- mode stop(di) = uo is det.

I don't understand the comment about uniqueness.

>     % Sometimes (e.g. when lexing the io__state) you wan't access to the
>     % input stream without interrupting the lexing process.  This pred.
>     % gives that sort of access.
>     %
> :- pred manipulate_source(pred(Src, Src),
>                 lexer_state(_Tok, Src), lexer_state(_Tok, Src)).
> :- mode manipulate_source(pred(di, uo) is det, di, uo) is det.

Why `_Tok' rather than `Tok'?

s/wan't/want/

>     % It's much more convenient (especially for integration with, e.g.
>     % parsers such as moose) to package up the lexer_instance, buf
>     % and Src in a single object.
>     %
>     % XXX Unfortunately, we don't yet have nested unique objects, so
>     % we have to fake it out with a judicious bit of C.  Forgive me...
>     % Note that this code uses non-public parts of the Mercury runtime
>     % (e.g. MR_GC_malloc) and is therefore on somewhat dodgy ground.
> 
> :- type lexer_state(_Tok, _Src)
>     --->    lexer_state(c_pointer).
> 
> :- pragma c_header_code("
>     struct c_lexer_state {
>         MR_Word lexer_instance;
>         MR_Word buf;
>         MR_Word src;
>     };
> ").

I don't see why you need to use C here.
Is there some reason why `unsafe_promise_unique' won't do the trick?

> :- pragma c_code(
>     args_lexer_state(RL::in(lexer_instance), B::array_di, S::in, LS::out),
>     [will_not_call_mercury],
>     "
>     LS = (MR_Word) MR_GC_malloc(sizeof(struct c_lexer_state));
>     ((struct c_lexer_state *) LS)->lexer_instance = RL;
>     ((struct c_lexer_state *) LS)->buf           = B;
>     ((struct c_lexer_state *) LS)->src           = S;
> ").

That code won't work

> :- module lex__buf.
...
>     % XXX I suspect (from looking at the source) that array.m
>     % allocates one word per char for a char array.  This, of
>     % course, is shockingly wasteful.  Should we have a
>     % dedicated char_array data type (and bool_array and ...)
>     % or should array.m do packing where possible?
>     %
> :- type buf
>     ==      array(char).

It's not possible for `array(T)' to pack in the case when

> read_from_stdin(_Offset, Result) -->
>     io__read_char(IOResult),
>     { IOResult = ok(Char),              Result = ok(Char)
>     ; IOResult = eof,                   Result = eof
>     ; IOResult = error(E),              io__error_message(E, M), throw(M)
>     }.

I think it might be better to have just

      ; IOResult = error(_E),              throw(IOResult)


> :- pred compile_transitions_for_state(int, list(byte_transition), row,
>             transitions, transitions, transitions).
> :- mode compile_transitions_for_state(in, in, array_uo,
>             in(atom_transitions), in(atom_transitions),
> out(atom_transitions))
>                     is det.
> 
>     % XXX Wrong - need to sort on chars.
>     % Use parameterised list__sort.
> compile_transitions_for_state(_, IBTs, Row, [], Ts, Ts) :-
>     Row = array(list__sort(groovy_cmp, IBTs)).
> 
> compile_transitions_for_state(I, IBTs0, Row, [T | Ts0], Ts1, Ts) :-
>     ( if T = trans(I, C, Y) then
>         IBTs = [(Y << 8) \/ char__to_int(C) | IBTs0],
>         compile_transitions_for_state(I, IBTs, Row, Ts0, Ts1, Ts)
>       else
>         compile_transitions_for_state(I, IBTs0, Row, Ts0, [T | Ts1], Ts)
>     ).
> 
> 
> 
> :- func groovy_cmp(int, int) = comparison_result.
> 
> groovy_cmp(X, Y) = R :-
>     compare(R, X /\ 0xff, Y /\ 0xff).


I'm not quite sure what's happening in that code,
but it looks a bit dodgy.  Some more comments might help.

> % lex.regexp.m
...
> % Converts basic regular expressions into finite non-deterministic
> % automata.

I suggesting adding `(FNA)' at the end here, to make it clear
what you mean later on when you use the acronym FNA.

But back when I was doing second year, I'm pretty sure we used to call
those things `NFAs' (non-deterministic finite automata) rather than
`FNAs' (finite non-deterministic automata).  Is your use of the term
`FNA' following some textbook?  Do you know which term is more
commonly used?

>     % Turn a regexp into an FNA.
>     %
> :- func regexp_to_fna(regexp) = state_mc.

Acronyms should be uppercase: s/regexp_to_fna/regexp_to_FNA/

> % Some useful single-char regexps.
> 
> :- pragma memo(digit/0).
> :- pragma memo(lower/0).
> :- pragma memo(upper/0).
> :- pragma memo(wspc/0).
> :- pragma memo(dot/0).
> 
> digit      = any("0123456789").
> lower      = any("abcdefghijklmnopqrstuvwxyz").
> upper      = any("ABCDEFGHIJKLMNOPQRSTUVWXYZ").
> wspc       = any(" \t\n\r\f\v").
> dot        = anybut("\n").
> alpha      = (lower \/ upper).
> alphanum   = (alpha \/ digit).
> identstart = (alpha \/ atom('_')).
> ident      = (alphanum \/ atom('_')).
> nl         = atom('\n').
> tab        = atom('\t').
> spc        = atom(' ').

I think it would be helpful to document why you chose to use
`pragma memo' for some of these but not all of them.

Apart from those issues, this looks good, although I haven't
really examined the interface design in much depth.
But I'd be happy for you to add this to the extras now,
once the issues above are addressed.

-- 
Fergus Henderson <fjh at cs.mu.oz.au>  |  "I have always known that the pursuit
                                    |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions:          mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------



More information about the developers mailing list