[m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines

Julien Fischer jfischer at opturion.com
Thu Apr 15 14:05:31 AEST 2021


On Thu, 15 Apr 2021, Zoltan Somogyi wrote:

> Approach 2: as you say, implement a variant of read_line_as_string
> that does not include the newline, and then call it repeatedly.
> The downsides of this approach are the need to pay the cost of
> crossing between Mercury and the target language once per line
> (e.g. to create and then test a read_line_as_string_result),
> and possibly having to use (and then not being able to *re*use)
> a malloc'd block as a line buffer, which requires extra copying.

You can minimise the costs of creating the results by using unboxed
versions of the relevant operations.

> Approach 3 could be doing approach 1 but using foreign code.
> The foreign code could allocate space for the whole file using malloc,
> not Boehm, which would then allow it to free this block when done.
> It could then call Boehm to allocate space for each line as it finds
> the newline chars, and add the result to the end of a list, without
> having to reverse any list.
>
> Approach 4 is the same as approach 3, but instead of calling
> stat on the file, mallocing a block of the required size and reading
> the file into it, it would just mmap the file. This would be faster,
> but maybe less portable. (With either approach, we will have to
> document the fact that read_named_file_as_X can handle only
> ordinary files, not special files such as ttys or named pipes,
> since stat cannot give a meaningful size for those.)
>
> I believe approach 4 is likely to yield the fastest result, and thus
> should be the one implemented. Would you argue for approach 2?
> And does anyone have an even better approach?

For (4), you are going have to deal with a least four cases:

     a. POSIX like systems for the C backends.
     b. Windows for the C backends.
     c. Java
     d. C#

It is possible to do memory-mapped I/O with all of the above, although
I'm not sure off the top of my head how significantly the details
differ between them.

I agree with Peter that you should use the mmap approach (where
possible) and provide a fallback implementation.


> By the way, a version of read_line_as_string that does not add
> the final newline is likely to be useful in its own right, even if
> we decide against approach 2. Possible names: read_line_as_string_nonl,
> and read_line_as_chomped_string (as in perl's chomp function).

Or indeed, as in Mercury's chomp function ;-)

Julien.


More information about the reviews mailing list