[m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines
Peter Wang
novalazy at gmail.com
Wed Apr 14 12:12:58 AEST 2021
On Wed, 14 Apr 2021 01:26:10 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> For review by anyone. I am particularly interested in feedback
> on the comments about tail recursion in string.m.
>
> Zoltan.
> Simplify reading files as lines.
>
> library/io.m:
> Add two new predicates.
>
> The new predicate read_whole_line_as_string reads in a named file,
> and returns its contents as a single giant string. This is effectively
> a combination of open_input, read_file_as_string, and close_input.
s/whole_line/whole_file
>
> The new predicate read_whole_line_as_lines reads in a named file,
> and returns its contents as a list of strings, each containing
> the contents one line (minus the newline char at the end, if any).
> This is effectively a combination of open_input, read_file_as_string,
> split_into_lines (see below), and close_input.
s/whole_line/whole_file
>
> library/string.m:
> Add a new function, split_into_lines, which breaks up a string into
> its constituent lines, returning each line minus its newline.
>
> This differs from the old split_at_char('\n', ...) in that
> expects the input string to end with a newline, and does not return
> an empty string after a final newline. (If there *are* characters
> after the final newline, it does return *them* as the final line.)
>
> NEWS:
> Announce the new predicates and function.
>
> compiler/source_file_map.m:
> Use the new functionality to simplify the code that reads in
> Mercury.modules files.
>
> tests/hard_coded/string_split.{m,exp}:
> Add tests for split_into_lines.
> diff --git a/compiler/source_file_map.m b/compiler/source_file_map.m
> index fed8588f1..7e092fdff 100644
> --- a/compiler/source_file_map.m
> +++ b/compiler/source_file_map.m
> + Lines = [HeadLine | TailLines],
> + ( if string.sub_string_search(HeadLine, "\t", TabIndex) then
> + string.length(HeadLine, LineLength),
> + string.between(HeadLine, 0, TabIndex, ModuleNameStr),
> + string.between(HeadLine, TabIndex + 1, LineLength, FileName),
Use string.unsafe_between.
> + ModuleName = string_to_sym_name(ModuleNameStr),
> + % XXX A module cannot be contained in two files, which means that
> + % ModuleName should be a unique key in the forward map.
> + % However, with nested modules, a single file may contain
> + % more than one module, so FileName may *not* be a unique key
> + % in the backward map.
> + % XXX However, if Mercury.modules contains more than one line
> + % with the same filename, then this code has a bug, because
> + % the call sequence
> + %
> + % bimap.set("module_a", "filename", !SourceFileMap)
> + % bimap.set("module_a.sub1", "filename", !SourceFileMap)
> + % bimap.set("module_a.sub2", "filename", !SourceFileMap)
> + %
> + % will leave only one key that maps to the value "filename",
> + % which will be the last one added ("module_a.sub2" in this case).
> + %
> + % XXX We should call bimap.det_insert here to abort in such
> + % situations, but I (zs) am not sure that output generated by
> + % write_source_file_map is guaranteed to be a bijection.
> + bimap.set(ModuleName, FileName, !SourceFileMap),
> + parse_source_file_map(TailLines, ErrorMsg, !SourceFileMap)
> else
> - read_until_char(EndChar, [Char | Chars0], Result, !IO)
> - )
> - ;
> - CharRes = eof,
> - (
> - Chars0 = [],
> - Result = eof
> - ;
> - Chars0 = [_ | _],
> - Result = ok(Chars0)
> + ErrorMsg = "the line <" ++ HeadLine ++ "> contains no tab char"
> )
Maybe just say "line N is missing a tab character"? I'm wary of dumping
unparsed garbage to the terminal. It's unlikely to come up unless
someone specifically crafts a Mercury.modules file (and, say, commits it
to a git repository), and we probably have many other places that could
be exploited, so leave it if you prefer.
> diff --git a/library/io.m b/library/io.m
> index 37ed9244e..023868296 100644
> --- a/library/io.m
> +++ b/library/io.m
> @@ -1299,6 +1299,32 @@
> % Whole file input predicates.
> %
>
> + % Open and read the named file, and if successful, return its contents
> + % as a string. If either the opening or the reading fails, return
> + % an error message describing the failure.
> + %
> + % WARNING: the returned string is NOT guaranteed to be valid UTF-8
> + % or UTF-16.
> + %
> +:- pred read_whole_file_as_string(string::in, io.res(string)::out,
> + io::di, io::uo) is det.
"whole" doesn't distinguish it from read_file_as_string.
Perhaps "read_named_file_as_string"?
> +
> + % Open and read the named file, and if successful, return its contents
> + % as a list of lines. If either the opening or the reading fails, return
> + % an error message describing the failure.
> + %
> + % This predicate views files as consisting of a sequence of lines,
> + % with each line consisting of a possibly empty sequence of non-newline
> + % characters, followed either by a newline character, or by the
> + % end of the file. The string returned for each line will not contain
> + % the newline character.
> + %
> + % WARNING: the returned string is NOT guaranteed to be valid UTF-8
> + % or UTF-16.
> + %
> +:- pred read_whole_file_as_lines(string::in, io.res(list(string))::out,
> + io::di, io::uo) is det.
> +
Same here.
> @@ -9132,6 +9158,45 @@ flush_binary_output(binary_output_stream(Stream), !IO) :-
> % Whole file input predicates.
> %
>
> +read_whole_file_as_string(FileName, Result, !IO) :-
> + io.open_input(FileName, OpenResult, !IO),
> + (
> + OpenResult = ok(FileStream),
> + io.read_file_as_string(FileStream, ReadResult, !IO),
> + (
> + ReadResult = ok(String),
> + Result = ok(String)
> + ;
> + ReadResult = error(_PartialString, Error),
> + Result = error(Error)
> + ),
> + io.close_input(FileStream, !IO)
> + ;
> + OpenResult = error(Error),
> + Result = error(Error)
> + ).
> +
> +read_whole_file_as_lines(FileName, Result, !IO) :-
> + io.open_input(FileName, OpenResult, !IO),
> + (
> + OpenResult = ok(FileStream),
> + io.read_file_as_string(FileStream, ReadResult, !IO),
Ideally, this predicate would not create one big string first.
It would require a variation on io.read_line_as_string that does not
include the terminating newline in its result.
> + (
> + ReadResult = ok(String),
> + Lines = split_into_lines(String),
> + Result = ok(Lines)
> + ;
> + ReadResult = error(_PartialString, Error),
> + Result = error(Error)
> + ),
> + io.close_input(FileStream, !IO)
> + ;
> + OpenResult = error(Error),
> + Result = error(Error)
> + ).
> +
> +%---------------------%
> +
> read_file(Result, !IO) :-
> input_stream(Stream, !IO),
> read_file(Stream, Result, !IO).
> diff --git a/library/string.m b/library/string.m
> index c36b5a7af..d08c5ef53 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -956,6 +956,14 @@
> %
> :- func split_at_string(string, string) = list(string).
>
> + % split_into_lines(String) break String into a sequence of lines,
breaks
> + % with each line consisting of a possibly empty sequence of non-newline
> + % characters, followed either by a newline character, or by the end
> + % of the string. The string returned for a line will not contain
> + % the newline character.
> + %
> +:- func split_into_lines(string) = list(string).
> +
> %---------------------------------------------------------------------------%
> %
> % Dealing with prefixes and suffixes.
> +split_at_string_loop(Separator, SeparatorLen, Str, CurPos, Segments) :-
> + ( if unsafe_sub_string_search_start(Str, Separator, CurPos, SepPos) then
> + HeadSegment = unsafe_between(Str, CurPos, SepPos),
> + % This call is not tail recursive. In the vast majority of cases,
> + % where Str is not very long, this is ok. Where Str *is* very long,
> + % performance may be bad, *unless* we compile this file with
> + % --optimize-constructor-last-call.
> + split_at_string_loop(Separator, SeparatorLen,
> + Str, SepPos + SeparatorLen, TailSegments),
> + Segments = [HeadSegment | TailSegments]
We compile with --optimize-constructor-last-call by default.
> else
> - unsafe_between(Total, StartAt, length(Total), Last),
> - Out = [Last]
> + unsafe_between(Str, CurPos, string.length(Str), LastSegment),
> + Segments = [LastSegment]
> + ).
> +
> +%---------------------%
> +
> +split_into_lines(Str) = Lines :-
> + split_into_lines_loop(Str, 0, cord.init, LinesCord),
> + Lines = cord.list(LinesCord).
I would just use list(T). list.reverse is simpler and more efficient
than cord.to_list, and there's no risk of confusion here IMO.
> +
> +:- pred split_into_lines_loop(string::in, int::in,
> + cord(string)::in, cord(string)::out) is det.
> +
> +split_into_lines_loop(Str, CurPos, !LinesCord) :-
> + ( if unsafe_sub_string_search_start(Str, "\n", CurPos, SepPos) then
> + Line = unsafe_between(Str, CurPos, SepPos),
> + !:LinesCord = cord.snoc(!.LinesCord, Line),
> + % Unlike split_at_string, split_into_lines can absolutely be expected
> + % to be invoked on huge strings fairly frequentl, which is why
> + % this is a tail call. The price of this optimization is the call
> + % to cord.list in our parent.
> + split_into_lines_loop(Str, SepPos + 1, !LinesCord)
frequently
But I suggest:
split_into_lines is used by io.read_whole_file_as_lines
so may be invoked on potentially huge strings,
which is why we ensure the following is a tail call.
> + else
> + unsafe_between(Str, CurPos, string.length(Str), LastLine),
> + ( if LastLine = "" then
> + true
Could check for CurPos = string.length(Str) instead.
> + else
> + !:LinesCord = cord.snoc(!.LinesCord, LastLine)
> + )
> ).
>
> %---------------------------------------------------------------------------%
Peter
More information about the reviews
mailing list