[m-dev.] For review: general purpose lexer module
Ralph Becket
rbeck at microsoft.com
Tue Feb 13 02:25:34 AEDT 2001
[Diff to follow...]
> > I propose that, if accepted, this should go in the standard library
> > since it is of general utility.
>
> Hmm, perhaps... I'd be inclined to put it in `extras' for now.
> If it gets used a lot or if there is much demand for it then
> we can move it into the standard library, as was done e.g. for
> the `exceptions' module. It's much harder to take things out
> of the standard library than to put them in, so we should be
> conservative about doing that...
Good policy; I agree.
> When adding stuff to `extras', you should modify extras/README
> and RELEASE_NOTES to mention the new stuff you've added.
Will do.
> I suggest writing that as
>
> main -->
> ...
> call((pred(IO0::di, IO::uo) is det :-
> State0 = lex__start(Lexer, IO0),
> tokenise_stdin(State0, State),
> IO = lex__stop(State)
> )).
Done.
>
> lex.m:
> > % Stop a running instance of a lexer and retrieve the input source.
> > %
> > % XXX Er, the mode isn't quite right. The result won't be unique
> > % if the input source wasn't. This is the price one pays for
> > % not having nested unique modes and polymorphic modes yet.
> > %
> > :- func stop(lexer_state(_Tok, Src)) = Src.
> > :- mode stop(di) = uo is det.
>
> I don't understand the comment about uniqueness.
The comment is wrong. Looking at the code, you *have* to supply a
unique source object, so you'll definitely get a unique one back
from stop/1.
I think I will go through the code and remove the various modes where
I mistakenly thought non-unique stuff would be going through.
> > :- pred manipulate_source(pred(Src, Src),
> > lexer_state(_Tok, Src), lexer_state(_Tok, Src)).
> > :- mode manipulate_source(pred(di, uo) is det, di, uo) is det.
>
> Why `_Tok' rather than `Tok'?
>
> s/wan't/want/
Typos. Fixed.
> > :- type lexer_state(_Tok, _Src)
> > ---> lexer_state(c_pointer).
> >
> > :- pragma c_header_code("
> > struct c_lexer_state {
> > MR_Word lexer_instance;
> > MR_Word buf;
> > MR_Word src;
> > };
> > ").
>
> I don't see why you need to use C here.
> Is there some reason why `unsafe_promise_unique' won't do the trick?
I wasn't aware of unsafe_promise_unique/2 at the time I wrote this code.
That's a much better solution which I'll use instead.
> > :- pragma c_code(
> > args_lexer_state(RL::in(lexer_instance), B::array_di, S::in,
LS::out),
> > [will_not_call_mercury],
> > "
> > LS = (MR_Word) MR_GC_malloc(sizeof(struct c_lexer_state));
> > ((struct c_lexer_state *) LS)->lexer_instance = RL;
> > ((struct c_lexer_state *) LS)->buf = B;
> > ((struct c_lexer_state *) LS)->src = S;
> > ").
>
> That code won't work
Well, the test program worked fine! But this is all going to be ripped
out before I check anything in.
> > % XXX I suspect (from looking at the source) that array.m
> > % allocates one word per char for a char array. This, of
> > % course, is shockingly wasteful. Should we have a
> > % dedicated char_array data type (and bool_array and ...)
> > % or should array.m do packing where possible?
> > %
> > :- type buf
> > == array(char).
>
> It's not possible for `array(T)' to pack in the case when
I think this is time to write byte_array.m and have char_array as a
special case. I'll do that.
> > read_from_stdin(_Offset, Result) -->
> > io__read_char(IOResult),
> > { IOResult = ok(Char), Result = ok(Char)
> > ; IOResult = eof, Result = eof
> > ; IOResult = error(E), io__error_message(E, M),
throw(M)
> > }.
>
> I think it might be better to have just
>
> ; IOResult = error(_E), throw(IOResult)
Yes - fixed.
> > compile_transitions_for_state(I, IBTs0, Row, [T | Ts0], Ts1, Ts) :-
> > ( if T = trans(I, C, Y) then
> > IBTs = [(Y << 8) \/ char__to_int(C) | IBTs0],
> > compile_transitions_for_state(I, IBTs, Row, Ts0, Ts1, Ts)
> > else
> > compile_transitions_for_state(I, IBTs0, Row, Ts0, [T | Ts1], Ts)
> > ).
> >
> >
> >
> > :- func groovy_cmp(int, int) = comparison_result.
> >
> > groovy_cmp(X, Y) = R :-
> > compare(R, X /\ 0xff, Y /\ 0xff).
>
>
> I'm not quite sure what's happening in that code,
> but it looks a bit dodgy. Some more comments might help.
Yes - here I'm doing some bit-packing. The transition function for each
state is represented as an array of {Char, NextState} pairs sorted by
Char so we can use binary search to do fast lookups (*).
To save a few bytes here and there, the {Char, NextState} pairs are
encoded as ints where the top 24 bits record NextState and the low
8 bits record Char. More commenting and some sensible packing/unpacking
functions will be going in there instead of in-line bit magic.
(*) Since most states seem to have very few transitions (transitions to
the error state are not recorded) it may well be more efficient in general
to just do a linear scan rather than a binary search. The most common
counter-example to this is the `dot' regexp which matches anything except
newline.
> > % Converts basic regular expressions into finite non-deterministic
> > % automata.
>
> I suggesting adding `(FNA)' at the end here, to make it clear
> what you mean later on when you use the acronym FNA.
Will do.
> But back when I was doing second year, I'm pretty sure we used to call
> those things `NFAs' (non-deterministic finite automata) rather than
> `FNAs' (finite non-deterministic automata). Is your use of the term
> `FNA' following some textbook? Do you know which term is more
> commonly used?
FNA and FDA were the acronyms used in the course I studied, but I think
NFA and DFA are more popular outside Cambridge. I'll change over to
the latter.
> Acronyms should be uppercase: s/regexp_to_fna/regexp_to_FNA/
Hmm. I have a function `fna_to_fda' which would need renaming to
something like `convert_FNA_to_FDA'. Okay, I'll do that.
> > % Some useful single-char regexps.
> >
> > :- pragma memo(digit/0).
> > :- pragma memo(lower/0).
> > :- pragma memo(upper/0).
> > :- pragma memo(wspc/0).
> > :- pragma memo(dot/0).
> >
> > digit = any("0123456789").
> > lower = any("abcdefghijklmnopqrstuvwxyz").
> > upper = any("ABCDEFGHIJKLMNOPQRSTUVWXYZ").
> > wspc = any(" \t\n\r\f\v").
> > dot = anybut("\n").
> > alpha = (lower \/ upper).
> > alphanum = (alpha \/ digit).
> > identstart = (alpha \/ atom('_')).
> > ident = (alphanum \/ atom('_')).
> > nl = atom('\n').
> > tab = atom('\t').
> > spc = atom(' ').
>
> I think it would be helpful to document why you chose to use
> `pragma memo' for some of these but not all of them.
Will do. The idea here is that these things are (a) common and
(b) take *some* time to compute and (c) are constants, hence it's
worth telling the compiler to memo them. Are these redundant? Will
the compiler fully evaluate constant functions?
> Apart from those issues, this looks good, although I haven't
> really examined the interface design in much depth.
> But I'd be happy for you to add this to the extras now,
> once the issues above are addressed.
Okay, I'll get the above fixed.
--
Ralph Becket | MSR Cambridge | rbeck at microsoft.com
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to: mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions: mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------
More information about the developers
mailing list