[m-dev.] For review: general purpose lexer module
Fergus Henderson
fjh at cs.mu.OZ.AU
Sat Feb 10 15:45:23 AEDT 2001
On 09-Feb-2001, Ralph Becket <rbeck at microsoft.com> wrote:
> This is a mostly efficient lexer module I wrote that will handle
> arbitrary regular expressions and is easy to use.
At first glance that looks good.
> I propose that, if accepted, this should go in the standard library
> since it is of general utility.
Hmm, perhaps... I'd be inclined to put it in `extras' for now.
If it gets used a lot or if there is much demand for it then
we can move it into the standard library, as was done e.g. for
the `exceptions' module. It's much harder to take things out
of the standard library than to put them in, so we should be
conservative about doing that...
When adding stuff to `extras', you should modify extras/README
and RELEASE_NOTES to mention the new stuff you've added.
> :- module test_lex.
> main -->
...
> { Kludge =
> ( pred(IO0::di, IO::uo) is det :-
> State0 = lex__start(Lexer, IO0),
> tokenise_stdin(State0, State),
> IO = lex__stop(State)
> )
> },
> Kludge.
I suggest writing that as
main -->
...
call((pred(IO0::di, IO::uo) is det :-
State0 = lex__start(Lexer, IO0),
tokenise_stdin(State0, State),
IO = lex__stop(State)
)).
lex.m:
> % Stop a running instance of a lexer and retrieve the input source.
> %
> % XXX Er, the mode isn't quite right. The result won't be unique
> % if the input source wasn't. This is the price one pays for
> % not having nested unique modes and polymorphic modes yet.
> %
> :- func stop(lexer_state(_Tok, Src)) = Src.
> :- mode stop(di) = uo is det.
I don't understand the comment about uniqueness.
> % Sometimes (e.g. when lexing the io__state) you wan't access to the
> % input stream without interrupting the lexing process. This pred.
> % gives that sort of access.
> %
> :- pred manipulate_source(pred(Src, Src),
> lexer_state(_Tok, Src), lexer_state(_Tok, Src)).
> :- mode manipulate_source(pred(di, uo) is det, di, uo) is det.
Why `_Tok' rather than `Tok'?
s/wan't/want/
> % It's much more convenient (especially for integration with, e.g.
> % parsers such as moose) to package up the lexer_instance, buf
> % and Src in a single object.
> %
> % XXX Unfortunately, we don't yet have nested unique objects, so
> % we have to fake it out with a judicious bit of C. Forgive me...
> % Note that this code uses non-public parts of the Mercury runtime
> % (e.g. MR_GC_malloc) and is therefore on somewhat dodgy ground.
>
> :- type lexer_state(_Tok, _Src)
> ---> lexer_state(c_pointer).
>
> :- pragma c_header_code("
> struct c_lexer_state {
> MR_Word lexer_instance;
> MR_Word buf;
> MR_Word src;
> };
> ").
I don't see why you need to use C here.
Is there some reason why `unsafe_promise_unique' won't do the trick?
> :- pragma c_code(
> args_lexer_state(RL::in(lexer_instance), B::array_di, S::in, LS::out),
> [will_not_call_mercury],
> "
> LS = (MR_Word) MR_GC_malloc(sizeof(struct c_lexer_state));
> ((struct c_lexer_state *) LS)->lexer_instance = RL;
> ((struct c_lexer_state *) LS)->buf = B;
> ((struct c_lexer_state *) LS)->src = S;
> ").
That code won't work
> :- module lex__buf.
...
> % XXX I suspect (from looking at the source) that array.m
> % allocates one word per char for a char array. This, of
> % course, is shockingly wasteful. Should we have a
> % dedicated char_array data type (and bool_array and ...)
> % or should array.m do packing where possible?
> %
> :- type buf
> == array(char).
It's not possible for `array(T)' to pack in the case when
> read_from_stdin(_Offset, Result) -->
> io__read_char(IOResult),
> { IOResult = ok(Char), Result = ok(Char)
> ; IOResult = eof, Result = eof
> ; IOResult = error(E), io__error_message(E, M), throw(M)
> }.
I think it might be better to have just
; IOResult = error(_E), throw(IOResult)
> :- pred compile_transitions_for_state(int, list(byte_transition), row,
> transitions, transitions, transitions).
> :- mode compile_transitions_for_state(in, in, array_uo,
> in(atom_transitions), in(atom_transitions),
> out(atom_transitions))
> is det.
>
> % XXX Wrong - need to sort on chars.
> % Use parameterised list__sort.
> compile_transitions_for_state(_, IBTs, Row, [], Ts, Ts) :-
> Row = array(list__sort(groovy_cmp, IBTs)).
>
> compile_transitions_for_state(I, IBTs0, Row, [T | Ts0], Ts1, Ts) :-
> ( if T = trans(I, C, Y) then
> IBTs = [(Y << 8) \/ char__to_int(C) | IBTs0],
> compile_transitions_for_state(I, IBTs, Row, Ts0, Ts1, Ts)
> else
> compile_transitions_for_state(I, IBTs0, Row, Ts0, [T | Ts1], Ts)
> ).
>
>
>
> :- func groovy_cmp(int, int) = comparison_result.
>
> groovy_cmp(X, Y) = R :-
> compare(R, X /\ 0xff, Y /\ 0xff).
I'm not quite sure what's happening in that code,
but it looks a bit dodgy. Some more comments might help.
> % lex.regexp.m
...
> % Converts basic regular expressions into finite non-deterministic
> % automata.
I suggesting adding `(FNA)' at the end here, to make it clear
what you mean later on when you use the acronym FNA.
But back when I was doing second year, I'm pretty sure we used to call
those things `NFAs' (non-deterministic finite automata) rather than
`FNAs' (finite non-deterministic automata). Is your use of the term
`FNA' following some textbook? Do you know which term is more
commonly used?
> % Turn a regexp into an FNA.
> %
> :- func regexp_to_fna(regexp) = state_mc.
Acronyms should be uppercase: s/regexp_to_fna/regexp_to_FNA/
> % Some useful single-char regexps.
>
> :- pragma memo(digit/0).
> :- pragma memo(lower/0).
> :- pragma memo(upper/0).
> :- pragma memo(wspc/0).
> :- pragma memo(dot/0).
>
> digit = any("0123456789").
> lower = any("abcdefghijklmnopqrstuvwxyz").
> upper = any("ABCDEFGHIJKLMNOPQRSTUVWXYZ").
> wspc = any(" \t\n\r\f\v").
> dot = anybut("\n").
> alpha = (lower \/ upper).
> alphanum = (alpha \/ digit).
> identstart = (alpha \/ atom('_')).
> ident = (alphanum \/ atom('_')).
> nl = atom('\n').
> tab = atom('\t').
> spc = atom(' ').
I think it would be helpful to document why you chose to use
`pragma memo' for some of these but not all of them.
Apart from those issues, this looks good, although I haven't
really examined the interface design in much depth.
But I'd be happy for you to add this to the extras now,
once the issues above are addressed.
--
Fergus Henderson <fjh at cs.mu.oz.au> | "I have always known that the pursuit
| of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh> | -- the last words of T. S. Garp.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to: mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions: mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------
More information about the developers
mailing list