[mercury-users] a question on efficient parsing of a file

Ralph Becket rafe at csse.unimelb.edu.au
Wed May 19 12:16:13 AEST 2010


Vladimir Gubarkov, Tuesday, 18 May 2010:
>    Thank you, Ralph for your kind reply.
> 
>    It seems there is at least one problem with this approach. The thing is
>    that a whole file is one huge string, as it doesn't contain '\n'
>    (floats are separated by " " space char). If you are interested in more
>    formal description of a problem you could refer
> 
>    [1]http://www.rsdn.ru/forum/decl/2848651.flat.aspx (russian)
> 
>    [2]http://translate.google.ru/translate?js=y&prev=_t&hl=ru&ie=UTF-8&lay
>    out=1&eotf=1&u=http://www.rsdn.ru/forum/decl/2848651.flat.aspx&sl=ru&tl
>    =en (google-translated)
> 
>    So I believe your solution would be close to reading whole file to
>    memory.

Ah, righto.

In that case, why not do something like this (warning: untested code):


:- pred read_non_whitespace_word(list(char)::out, io::di, io::uo) is det.

read_non_whitespace_word(Cs, !IO) :-
    consume_whitespace(!IO),    % Skip over initial whitespace.
    io.read_char(Result, !IO),
    (
        Result = eof,
        Cs = []
    ;
        Result = ok(C),
        ( if char.is_whitespace(C) then
            Cs = []
          else
            read_non_whitespace_word(Cs0, !IO),
            Cs = [C | Cs0]
        )
    ;
        Result = error(Error),
        throw(Error)
    ).

:- pred consume_whitespace(io::di, io::uo) is det.

consume_whitespace(!IO) :-
    io.read_char(Result, !IO),
    (
        Result = eof
    ;
        Result = ok(C),
        ( if char.is_whitespace(C) then
            consume_whitespace(!IO)
          else
            io.putback_char(C, !IO)
        )
    ;
        Result = error(Error),
        throw(Error)
    ).


Now you can use read_non_whitespace_word and a DCG parser on each
word as it is scanned.

>    Also, I don't like parsing_utils for several reasons.
> 
>    First - it's not flexible. Say, I want to parse not floats, but hex
>    nubers or smth. else.

It is trivial to implement your own parsers for these things using
parsing_utils; that was the motivation for the library.

For example,


:- pred hex_literal(src::in, int::out, ps::in, ps::out) is semidet.

hex_literal(Src, N, !PS) :-
    hex_literal_2(Src, 0, N, !PS).

:- pred hex_literal_2(src::in, int::in, int::out, ps::in, ps::out) is semidet.

hex_literal_2(Src, N0, N, !PS) :-
    ( if
        next_char(Src, C, !PS),
        char.is_hex_digit(C, D)
      then
        hex_literal_2(Src, 16 * N0 + D, N, !PS)
      else
        N = N0
    ).


If you're less interested in efficiency, you could write it slightly more
succinctly like this:

hex_literal(Src, N, !PS) :-
    one_or_more(hex_digit, Src, Ds, !PS),
    N = list.foldr(func(N0, D) = 16 * N0 + D, Ds, 0).

hex_digit(Src, D, !PS) :-
    next_char(Src, C, !PS),
    char.is_hex_digit(C, D).


parsing_utils does require you to provide data as a string.  In most
applications this is quite reasonable: read in the entire file as a
single string then parse it.  In situations where this is not sensible,
you have to do more work, as I did at the top of this reply.


>    Second - its stability stated as low..

That should probably be changed now!  We're pretty happy with the
functionality an interface in parsing_utils.

>    And third - it operates over string and current position, and my
>    experiments seem to show that this way is not as fast as operating on
>    list of chars.

I'm very surprised to hear that.  parsing_utils reads chars directly
from the string source, which is just a bounds test and a memory
reference (see string.index).  List creation and manipulation is
typically much slower.

>    (Btw, interesting that Visual Prolog is very fast when operating on
>    strings, namely their predicates front, frontChar, etc.. as if
>    internally they represent string as list of chars(?). Sad, but Mercury
>    analog string.first_char (and others) creating a copy of Rest in favour
>    of GC, thus resulting in poor performance and large mem usage. But even
>    using approach parse_state(position, string) performs way slowly then
>    dcg over list of chars or Visual Prolog strings)

That's a very good reason not to use string.first_char for parsing!

HTH,
-- Ralph
--------------------------------------------------------------------------
mercury-users mailing list
Post messages to:       mercury-users at csse.unimelb.edu.au
Administrative Queries: owner-mercury-users at csse.unimelb.edu.au
Subscriptions:          mercury-users-request at csse.unimelb.edu.au
--------------------------------------------------------------------------



More information about the users mailing list