[m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines
Julien Fischer
jfischer at opturion.com
Thu Apr 15 14:05:31 AEST 2021
On Thu, 15 Apr 2021, Zoltan Somogyi wrote:
> Approach 2: as you say, implement a variant of read_line_as_string
> that does not include the newline, and then call it repeatedly.
> The downsides of this approach are the need to pay the cost of
> crossing between Mercury and the target language once per line
> (e.g. to create and then test a read_line_as_string_result),
> and possibly having to use (and then not being able to *re*use)
> a malloc'd block as a line buffer, which requires extra copying.
You can minimise the costs of creating the results by using unboxed
versions of the relevant operations.
> Approach 3 could be doing approach 1 but using foreign code.
> The foreign code could allocate space for the whole file using malloc,
> not Boehm, which would then allow it to free this block when done.
> It could then call Boehm to allocate space for each line as it finds
> the newline chars, and add the result to the end of a list, without
> having to reverse any list.
>
> Approach 4 is the same as approach 3, but instead of calling
> stat on the file, mallocing a block of the required size and reading
> the file into it, it would just mmap the file. This would be faster,
> but maybe less portable. (With either approach, we will have to
> document the fact that read_named_file_as_X can handle only
> ordinary files, not special files such as ttys or named pipes,
> since stat cannot give a meaningful size for those.)
>
> I believe approach 4 is likely to yield the fastest result, and thus
> should be the one implemented. Would you argue for approach 2?
> And does anyone have an even better approach?
For (4), you are going have to deal with a least four cases:
a. POSIX like systems for the C backends.
b. Windows for the C backends.
c. Java
d. C#
It is possible to do memory-mapped I/O with all of the above, although
I'm not sure off the top of my head how significantly the details
differ between them.
I agree with Peter that you should use the mmap approach (where
possible) and provide a fallback implementation.
> By the way, a version of read_line_as_string that does not add
> the final newline is likely to be useful in its own right, even if
> we decide against approach 2. Possible names: read_line_as_string_nonl,
> and read_line_as_chomped_string (as in perl's chomp function).
Or indeed, as in Mercury's chomp function ;-)
Julien.
More information about the reviews
mailing list