[m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines

Peter Wang novalazy at gmail.com
Thu Apr 15 12:56:42 AEST 2021


On Thu, 15 Apr 2021 02:42:09 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> 
> I can see several ways we could implement read_named_file_as_lines.
> 
> Approach 1: the one in this diff, i.e. read_named_file_as string, followed
> by split_into_lines. The downside, as you point out, is the need
> for N+1 memory allocations for N lines, totalling a bit more than double
> the size of the file (the extra is due to the need to round up the
> allocations for the lines to Boehm alloc size). We also allocate 2N cons
> cells, only N of which end up in the result.
> 
> Approach 2: as you say, implement a variant of read_line_as_string
> that does not include the newline, and then call it repeatedly.
> The downsides of this approach are the need to pay the cost of
> crossing between Mercury and the target language once per line
> (e.g. to create and then test a read_line_as_string_result),
> and possibly having to use (and then not being able to *re*use)
> a malloc'd block as a line buffer, which requires extra copying.
> 
> Approach 3 could be doing approach 1 but using foreign code.
> The foreign code could allocate space for the whole file using malloc,
> not Boehm, which would then allow it to free this block when done.
> It could then call Boehm to allocate space for each line as it finds
> the newline chars, and add the result to the end of a list, without
> having to reverse any list.
> 
> Approach 4 is the same as approach 3, but instead of calling
> stat on the file, mallocing a block of the required size and reading
> the file into it, it would just mmap the file. This would be faster,
> but maybe less portable. (With either approach, we will have to
> document the fact that read_named_file_as_X can handle only
> ordinary files, not special files such as ttys or named pipes,
> since stat cannot give a meaningful size for those.)
> 
> I believe approach 4 is likely to yield the fastest result, and thus
> should be the one implemented. Would you argue for approach 2?
> And does anyone have an even better approach?

I prefer if we didn't encourage hidden limitations in programs.
You can, however, implement the mmap approach and fallback to the
current implementation.

For a portable but maybe-better fallback, we could implement the inner
loop of read_named_file_as_lines such that it reuses a temporary buffer
across the entire stream. This would likely wholly be foreign code.
Unfortunately, there would be little code shared with
read_line_as_string as that must not read past a newline.

> By the way, a version of read_line_as_string that does not add
> the final newline is likely to be useful in its own right, even if
> we decide against approach 2. Possible names: read_line_as_string_nonl,
> and read_line_as_chomped_string (as in perl's chomp function).
> Any other name proposals, or disagreements with the desirability
> of adding such a predicare?

_nonl is okay, or perhaps an overloaded predicate that takes the line
ending handling as an argument? It could then also have an option to
drop CRLF line terminators if found. There is a danger of trying to do
too much.

Peter


More information about the reviews mailing list