[mercury-users] a question on efficient parsing of a file

Vladimir Gubarkov xonixx at gmail.com
Tue May 18 20:40:06 AEST 2010


Thank you, Ralph for your kind reply.

It seems there is at least one problem with this approach. The thing is that
a whole file is one huge string, as it doesn't contain '\n' (floats are
separated by " " space char). If you are interested in more formal
description of a problem you could refer

http://www.rsdn.ru/forum/decl/2848651.flat.aspx (russian)
http://translate.google.ru/translate?js=y&prev=_t&hl=ru&ie=UTF-8&layout=1&eotf=1&u=http://www.rsdn.ru/forum/decl/2848651.flat.aspx&sl=ru&tl=en
 (google-translated)

So I believe your solution would be close to reading whole file to memory.

Also, I don't like parsing_utils for several reasons.

First - it's not flexible. Say, I want to parse not floats, but hex nubers
or smth. else.
Second - its stability stated as low..
And third - it operates over string and current position, and my experiments
seem to show that this way is not as fast as operating on list of chars.

(Btw, interesting that Visual Prolog is very fast when operating on strings,
namely their predicates front, frontChar, etc.. as if internally they
represent string as list of chars(?). Sad, but Mercury
analog string.first_char (and others) creating a copy of Rest in favour of
GC, thus resulting in poor performance and large mem usage. But even using
approach parse_state(position, string) performs way slowly then dcg over
list of chars or Visual Prolog strings)


On Tue, May 18, 2010 at 3:52 AM, Ralph Becket <rafe at csse.unimelb.edu.au>wrote:

> Hi Vladimir,
>
> I'd recommend doing something like this: reading your file one line at a
> time with io.read_line_as_string then using the parsing_utils library to
> extract the floats from each line:
>
>    io.read_line_as_string(Result, !IO),
>    (
>        Result = ok(String),
>        some [!PS] (
>            parsing_utils.new_src_and_ps(String, Src, PS),
>            ( if
>                parsing_utils.zero_or_more(parsing_utils.float_literal,
>                    Src, Xs, !PS),
>                parsing_utils.eof(Src, _, !PS)
>              then
>                ... do something with Xs (a list of floats) ...
>              else
>                ... report a syntax error ...
>            )
>        )
>    ;
>        Result = eof
>    ;
>        Result = error(ErrorCode),
>        ... report the IO error ...
>    )
>
> Hope this helps!
> -- Ralph
>
> Vladimir Gubarkov, Monday, 17 May 2010:
> >
> >    Hi,
> >
> >    Imagine I have a long enough (to fit in memory) text file with regular
> >    data, say, a lot of float numbers, divided by space.
> >
> >    Now I want to parse those to find the sum of all numbers. In prolog
> >    (namely, SWI) I used 'phrase_from_file' predicate which allowed to
> >    parse file by DCG in lazy manner (no need to read whole file to
> >    memory). If it's interesting I used next code:
> >
> >    :- set_prolog_flag(float_format,'%.15g').
> >    integer(I) -->
> >            digit(D0),
> >            digits(D),
> >            { number_chars(I, [D0|D])
> >            }.
> >    digits([D|T]) -->
> >            digit(D), !,
> >            digits(T).
> >    digits([]) -->
> >            [].
> >    digit(D) -->
> >            [D],
> >            { code_type(D, digit)
> >            }.
> >    float(F) -->
> >        ( "-", {Sign = -1}
> >        ; "", {Sign = 1}
> >        ), !,
> >        integer(N),
> >        ",",
> >        integer(D),
> >        {F is Sign * (N + D / 10^(ceiling(log10(D))))
> >        }.
> >    sum(S, Total) -->
> >        float(F1), !,
> >        " ",
> >        { S1 is S + F1},
> >        sum(S1, Total).
> >    sum(Total, Total) -->
> >        [].
> >    go1 :-
> >        phrase_from_file(sum(0, S),'numbers_large.txt',
> >    [buffer_size(16384)]),
> >        writeln(S).
> >
> >    Now, for an excercise in mercury, I'm willing to write the mercury
> >    analog. If I understand correctly there is no direct analog to
> >    'phrase_from_file' in mercury, am I right?
> >
> >    So, I decided to fake this by constructing some type like:
> >
> >    :- type parse_state ---> state(buffer_size, buffer, io.state).
> >    :- type buffer_size == int.
> >    :- type buffer == list(char).
> >
> >    and pass this aroung those predicates like
> >
> >    :- pred some_dcg_pred(some_term::out, parse_state::in,
> >    parse_state::out) is semidet.
> >
> >    My thought was that I would take chars from 'buffer', and if it's
> >    empty -> read 'buffer_size' chars from io.state.
> >
> >    But! It seems that io library provides no support for buffered reading
> >    Oo. And without that, I guess, it'll be rather slow (reading 1 char at
> >    a time). Interesting, that I've looked inside the source of io module
> >    and it internally uses buffered reading, but predicates not exported
> >    to interface of those.
> >
> >    Diar sirs, what could you recommend on writing efficient (as well as
> >    elegant) analog to prolog code?
> >
> >    Sincerely yours,
> >
> >    Vladimir.
> --------------------------------------------------------------------------
> mercury-users mailing list
> Post messages to:       mercury-users at csse.unimelb.edu.au
> Administrative Queries: owner-mercury-users at csse.unimelb.edu.au
> Subscriptions:          mercury-users-request at csse.unimelb.edu.au
> --------------------------------------------------------------------------
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/users/attachments/20100518/247c212f/attachment.html>


More information about the users mailing list