[m-dev.] For review: general purpose lexer module

Ralph Becket rbeck at microsoft.com
Tue Feb 13 02:25:34 AEDT 2001


[Diff to follow...]

> > I propose that, if accepted, this should go in the standard library
> > since it is of general utility.
> 
> Hmm, perhaps... I'd be inclined to put it in `extras' for now.
> If it gets used a lot or if there is much demand for it then
> we can move it into the standard library, as was done e.g. for
> the `exceptions' module.  It's much harder to take things out
> of the standard library than to put them in, so we should be
> conservative about doing that...

Good policy; I agree.

> When adding stuff to `extras', you should modify extras/README
> and RELEASE_NOTES to mention the new stuff you've added.

Will do.

> I suggest writing that as
> 
> 	main -->
> 		...
> 		call((pred(IO0::di, IO::uo) is det :-
> 			State0 = lex__start(Lexer, IO0),
> 			tokenise_stdin(State0, State),
> 			IO     = lex__stop(State)
> 		)).

Done.

> 
> lex.m:
> >     % Stop a running instance of a lexer and retrieve the input source.
> >     %
> >     % XXX Er, the mode isn't quite right.  The result won't be unique
> >     % if the input source wasn't.  This is the price one pays for
> >     % not having nested unique modes and polymorphic modes yet.
> >     %
> > :- func stop(lexer_state(_Tok, Src)) = Src.
> > :- mode stop(di) = uo is det.
> 
> I don't understand the comment about uniqueness.

The comment is wrong.  Looking at the code, you *have* to supply a
unique source object, so you'll definitely get a unique one back
from stop/1.

I think I will go through the code and remove the various modes where
I mistakenly thought non-unique stuff would be going through.


> > :- pred manipulate_source(pred(Src, Src),
> >                 lexer_state(_Tok, Src), lexer_state(_Tok, Src)).
> > :- mode manipulate_source(pred(di, uo) is det, di, uo) is det.
> 
> Why `_Tok' rather than `Tok'?
> 
> s/wan't/want/

Typos.  Fixed.

> > :- type lexer_state(_Tok, _Src)
> >     --->    lexer_state(c_pointer).
> > 
> > :- pragma c_header_code("
> >     struct c_lexer_state {
> >         MR_Word lexer_instance;
> >         MR_Word buf;
> >         MR_Word src;
> >     };
> > ").
> 
> I don't see why you need to use C here.
> Is there some reason why `unsafe_promise_unique' won't do the trick?

I wasn't aware of unsafe_promise_unique/2 at the time I wrote this code.
That's a much better solution which I'll use instead.

> > :- pragma c_code(
> >     args_lexer_state(RL::in(lexer_instance), B::array_di, S::in,
LS::out),
> >     [will_not_call_mercury],
> >     "
> >     LS = (MR_Word) MR_GC_malloc(sizeof(struct c_lexer_state));
> >     ((struct c_lexer_state *) LS)->lexer_instance = RL;
> >     ((struct c_lexer_state *) LS)->buf           = B;
> >     ((struct c_lexer_state *) LS)->src           = S;
> > ").
> 
> That code won't work

Well, the test program worked fine!  But this is all going to be ripped
out before I check anything in.

> >     % XXX I suspect (from looking at the source) that array.m
> >     % allocates one word per char for a char array.  This, of
> >     % course, is shockingly wasteful.  Should we have a
> >     % dedicated char_array data type (and bool_array and ...)
> >     % or should array.m do packing where possible?
> >     %
> > :- type buf
> >     ==      array(char).
> 
> It's not possible for `array(T)' to pack in the case when

I think this is time to write byte_array.m and have char_array as a
special case.  I'll do that.

> > read_from_stdin(_Offset, Result) -->
> >     io__read_char(IOResult),
> >     { IOResult = ok(Char),              Result = ok(Char)
> >     ; IOResult = eof,                   Result = eof
> >     ; IOResult = error(E),              io__error_message(E, M),
throw(M)
> >     }.
> 
> I think it might be better to have just
> 
>       ; IOResult = error(_E),              throw(IOResult)

Yes - fixed.

> > compile_transitions_for_state(I, IBTs0, Row, [T | Ts0], Ts1, Ts) :-
> >     ( if T = trans(I, C, Y) then
> >         IBTs = [(Y << 8) \/ char__to_int(C) | IBTs0],
> >         compile_transitions_for_state(I, IBTs, Row, Ts0, Ts1, Ts)
> >       else
> >         compile_transitions_for_state(I, IBTs0, Row, Ts0, [T | Ts1], Ts)
> >     ).
> > 
> > 
> > 
> > :- func groovy_cmp(int, int) = comparison_result.
> > 
> > groovy_cmp(X, Y) = R :-
> >     compare(R, X /\ 0xff, Y /\ 0xff).
> 
> 
> I'm not quite sure what's happening in that code,
> but it looks a bit dodgy.  Some more comments might help.

Yes - here I'm doing some bit-packing.  The transition function for each
state is represented as an array of {Char, NextState} pairs sorted by
Char so we can use binary search to do fast lookups (*).

To save a few bytes here and there, the {Char, NextState} pairs are
encoded as ints where the top 24 bits record NextState and the low
8 bits record Char.  More commenting and some sensible packing/unpacking
functions will be going in there instead of in-line bit magic.

(*) Since most states seem to have very few transitions (transitions to
the error state are not recorded) it may well be more efficient in general
to just do a linear scan rather than a binary search.  The most common
counter-example to this is the `dot' regexp which matches anything except
newline.

> > % Converts basic regular expressions into finite non-deterministic
> > % automata.
> 
> I suggesting adding `(FNA)' at the end here, to make it clear
> what you mean later on when you use the acronym FNA.

Will do.

> But back when I was doing second year, I'm pretty sure we used to call
> those things `NFAs' (non-deterministic finite automata) rather than
> `FNAs' (finite non-deterministic automata).  Is your use of the term
> `FNA' following some textbook?  Do you know which term is more
> commonly used?

FNA and FDA were the acronyms used in the course I studied, but I think
NFA and DFA are more popular outside Cambridge.  I'll change over to
the latter.

> Acronyms should be uppercase: s/regexp_to_fna/regexp_to_FNA/

Hmm.  I have a function `fna_to_fda' which would need renaming to
something like `convert_FNA_to_FDA'.  Okay, I'll do that.

> > % Some useful single-char regexps.
> > 
> > :- pragma memo(digit/0).
> > :- pragma memo(lower/0).
> > :- pragma memo(upper/0).
> > :- pragma memo(wspc/0).
> > :- pragma memo(dot/0).
> > 
> > digit      = any("0123456789").
> > lower      = any("abcdefghijklmnopqrstuvwxyz").
> > upper      = any("ABCDEFGHIJKLMNOPQRSTUVWXYZ").
> > wspc       = any(" \t\n\r\f\v").
> > dot        = anybut("\n").
> > alpha      = (lower \/ upper).
> > alphanum   = (alpha \/ digit).
> > identstart = (alpha \/ atom('_')).
> > ident      = (alphanum \/ atom('_')).
> > nl         = atom('\n').
> > tab        = atom('\t').
> > spc        = atom(' ').
> 
> I think it would be helpful to document why you chose to use
> `pragma memo' for some of these but not all of them.

Will do.  The idea here is that these things are (a) common and
(b) take *some* time to compute and (c) are constants, hence it's
worth telling the compiler to memo them.  Are these redundant?  Will
the compiler fully evaluate constant functions?

> Apart from those issues, this looks good, although I haven't
> really examined the interface design in much depth.
> But I'd be happy for you to add this to the extras now,
> once the issues above are addressed.

Okay, I'll get the above fixed.

--
Ralph Becket      |      MSR Cambridge      |      rbeck at microsoft.com 

--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions:          mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------



More information about the developers mailing list