[mercury-users] a question on efficient parsing of a file
Ralph Becket
rafe at csse.unimelb.edu.au
Wed May 19 12:16:13 AEST 2010
Vladimir Gubarkov, Tuesday, 18 May 2010:
> Thank you, Ralph for your kind reply.
>
> It seems there is at least one problem with this approach. The thing is
> that a whole file is one huge string, as it doesn't contain '\n'
> (floats are separated by " " space char). If you are interested in more
> formal description of a problem you could refer
>
> [1]http://www.rsdn.ru/forum/decl/2848651.flat.aspx (russian)
>
> [2]http://translate.google.ru/translate?js=y&prev=_t&hl=ru&ie=UTF-8&lay
> out=1&eotf=1&u=http://www.rsdn.ru/forum/decl/2848651.flat.aspx&sl=ru&tl
> =en (google-translated)
>
> So I believe your solution would be close to reading whole file to
> memory.
Ah, righto.
In that case, why not do something like this (warning: untested code):
:- pred read_non_whitespace_word(list(char)::out, io::di, io::uo) is det.
read_non_whitespace_word(Cs, !IO) :-
consume_whitespace(!IO), % Skip over initial whitespace.
io.read_char(Result, !IO),
(
Result = eof,
Cs = []
;
Result = ok(C),
( if char.is_whitespace(C) then
Cs = []
else
read_non_whitespace_word(Cs0, !IO),
Cs = [C | Cs0]
)
;
Result = error(Error),
throw(Error)
).
:- pred consume_whitespace(io::di, io::uo) is det.
consume_whitespace(!IO) :-
io.read_char(Result, !IO),
(
Result = eof
;
Result = ok(C),
( if char.is_whitespace(C) then
consume_whitespace(!IO)
else
io.putback_char(C, !IO)
)
;
Result = error(Error),
throw(Error)
).
Now you can use read_non_whitespace_word and a DCG parser on each
word as it is scanned.
> Also, I don't like parsing_utils for several reasons.
>
> First - it's not flexible. Say, I want to parse not floats, but hex
> nubers or smth. else.
It is trivial to implement your own parsers for these things using
parsing_utils; that was the motivation for the library.
For example,
:- pred hex_literal(src::in, int::out, ps::in, ps::out) is semidet.
hex_literal(Src, N, !PS) :-
hex_literal_2(Src, 0, N, !PS).
:- pred hex_literal_2(src::in, int::in, int::out, ps::in, ps::out) is semidet.
hex_literal_2(Src, N0, N, !PS) :-
( if
next_char(Src, C, !PS),
char.is_hex_digit(C, D)
then
hex_literal_2(Src, 16 * N0 + D, N, !PS)
else
N = N0
).
If you're less interested in efficiency, you could write it slightly more
succinctly like this:
hex_literal(Src, N, !PS) :-
one_or_more(hex_digit, Src, Ds, !PS),
N = list.foldr(func(N0, D) = 16 * N0 + D, Ds, 0).
hex_digit(Src, D, !PS) :-
next_char(Src, C, !PS),
char.is_hex_digit(C, D).
parsing_utils does require you to provide data as a string. In most
applications this is quite reasonable: read in the entire file as a
single string then parse it. In situations where this is not sensible,
you have to do more work, as I did at the top of this reply.
> Second - its stability stated as low..
That should probably be changed now! We're pretty happy with the
functionality an interface in parsing_utils.
> And third - it operates over string and current position, and my
> experiments seem to show that this way is not as fast as operating on
> list of chars.
I'm very surprised to hear that. parsing_utils reads chars directly
from the string source, which is just a bounds test and a memory
reference (see string.index). List creation and manipulation is
typically much slower.
> (Btw, interesting that Visual Prolog is very fast when operating on
> strings, namely their predicates front, frontChar, etc.. as if
> internally they represent string as list of chars(?). Sad, but Mercury
> analog string.first_char (and others) creating a copy of Rest in favour
> of GC, thus resulting in poor performance and large mem usage. But even
> using approach parse_state(position, string) performs way slowly then
> dcg over list of chars or Visual Prolog strings)
That's a very good reason not to use string.first_char for parsing!
HTH,
-- Ralph
--------------------------------------------------------------------------
mercury-users mailing list
Post messages to: mercury-users at csse.unimelb.edu.au
Administrative Queries: owner-mercury-users at csse.unimelb.edu.au
Subscriptions: mercury-users-request at csse.unimelb.edu.au
--------------------------------------------------------------------------
More information about the users
mailing list