[m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines

Zoltan Somogyi zoltan.somogyi at runbox.com
Thu Apr 15 02:42:09 AEST 2021


2021-04-14 12:12 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
>> +read_whole_file_as_lines(FileName, Result, !IO) :-
>> +    io.open_input(FileName, OpenResult, !IO),
>> +    (
>> +        OpenResult = ok(FileStream),
>> +        io.read_file_as_string(FileStream, ReadResult, !IO),
> 
> Ideally, this predicate would not create one big string first.
> It would require a variation on io.read_line_as_string that does not
> include the terminating newline in its result.

I followed all your suggestions except this one, since I think that this
requires more discussion.

I can see several ways we could implement read_named_file_as_lines.

Approach 1: the one in this diff, i.e. read_named_file_as string, followed
by split_into_lines. The downside, as you point out, is the need
for N+1 memory allocations for N lines, totalling a bit more than double
the size of the file (the extra is due to the need to round up the
allocations for the lines to Boehm alloc size). We also allocate 2N cons
cells, only N of which end up in the result.

Approach 2: as you say, implement a variant of read_line_as_string
that does not include the newline, and then call it repeatedly.
The downsides of this approach are the need to pay the cost of
crossing between Mercury and the target language once per line
(e.g. to create and then test a read_line_as_string_result),
and possibly having to use (and then not being able to *re*use)
a malloc'd block as a line buffer, which requires extra copying.

Approach 3 could be doing approach 1 but using foreign code.
The foreign code could allocate space for the whole file using malloc,
not Boehm, which would then allow it to free this block when done.
It could then call Boehm to allocate space for each line as it finds
the newline chars, and add the result to the end of a list, without
having to reverse any list.

Approach 4 is the same as approach 3, but instead of calling
stat on the file, mallocing a block of the required size and reading
the file into it, it would just mmap the file. This would be faster,
but maybe less portable. (With either approach, we will have to
document the fact that read_named_file_as_X can handle only
ordinary files, not special files such as ttys or named pipes,
since stat cannot give a meaningful size for those.)

I believe approach 4 is likely to yield the fastest result, and thus
should be the one implemented. Would you argue for approach 2?
And does anyone have an even better approach?

By the way, a version of read_line_as_string that does not add
the final newline is likely to be useful in its own right, even if
we decide against approach 2. Possible names: read_line_as_string_nonl,
and read_line_as_chomped_string (as in perl's chomp function).
Any other name proposals, or disagreements with the desirability
of adding such a predicare?

Zoltan.


More information about the reviews mailing list