[m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines

Peter Wang novalazy at gmail.com
Wed Apr 14 12:12:58 AEST 2021
Previous message: [m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines
Next message: [m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 14 Apr 2021 01:26:10 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> For review by anyone. I am particularly interested in feedback
> on the comments about tail recursion in string.m.
> 
> Zoltan.

> Simplify reading files as lines.
> 
> library/io.m:
>     Add two new predicates.
> 
>     The new predicate read_whole_line_as_string reads in a named file,
>     and returns its contents as a single giant string. This is effectively
>     a combination of open_input, read_file_as_string, and close_input.

s/whole_line/whole_file

> 
>     The new predicate read_whole_line_as_lines reads in a named file,
>     and returns its contents as a list of strings, each containing
>     the contents one line (minus the newline char at the end, if any).
>     This is effectively a combination of open_input, read_file_as_string,
>     split_into_lines (see below), and close_input.

s/whole_line/whole_file

> 
> library/string.m:
>     Add a new function, split_into_lines, which breaks up a string into
>     its constituent lines, returning each line minus its newline.
> 
>     This differs from the old split_at_char('\n', ...) in that
>     expects the input string to end with a newline, and does not return
>     an empty string after a final newline. (If there *are* characters
>     after the final newline, it does return *them* as the final line.)
> 
> NEWS:
>     Announce the new predicates and function.
> 
> compiler/source_file_map.m:
>     Use the new functionality to simplify the code that reads in
>     Mercury.modules files.
> 
> tests/hard_coded/string_split.{m,exp}:
>     Add tests for split_into_lines.

> diff --git a/compiler/source_file_map.m b/compiler/source_file_map.m
> index fed8588f1..7e092fdff 100644
> --- a/compiler/source_file_map.m
> +++ b/compiler/source_file_map.m
> +        Lines = [HeadLine | TailLines],
> +        ( if string.sub_string_search(HeadLine, "\t", TabIndex) then
> +            string.length(HeadLine, LineLength),
> +            string.between(HeadLine, 0, TabIndex, ModuleNameStr),
> +            string.between(HeadLine, TabIndex + 1, LineLength, FileName),

Use string.unsafe_between.

> +            ModuleName = string_to_sym_name(ModuleNameStr),
> +            % XXX A module cannot be contained in two files, which means that
> +            % ModuleName should be a unique key in the forward map.
> +            % However, with nested modules, a single file may contain
> +            % more than one module, so FileName may *not* be a unique key
> +            % in the backward map.
> +            % XXX However, if Mercury.modules contains more than one line
> +            % with the same filename, then this code has a bug, because
> +            % the call sequence
> +            %
> +            %   bimap.set("module_a", "filename", !SourceFileMap)
> +            %   bimap.set("module_a.sub1", "filename", !SourceFileMap)
> +            %   bimap.set("module_a.sub2", "filename", !SourceFileMap)
> +            %
> +            % will leave only one key that maps to the value "filename",
> +            % which will be the last one added ("module_a.sub2" in this case).
> +            %
> +            % XXX We should call bimap.det_insert here to abort in such
> +            % situations, but I (zs) am not sure that output generated by
> +            % write_source_file_map is guaranteed to be a bijection.
> +            bimap.set(ModuleName, FileName, !SourceFileMap),
> +            parse_source_file_map(TailLines, ErrorMsg, !SourceFileMap)
>          else
> -            read_until_char(EndChar, [Char | Chars0], Result, !IO)
> -        )
> -    ;
> -        CharRes = eof,
> -        (
> -            Chars0 = [],
> -            Result = eof
> -        ;
> -            Chars0 = [_ | _],
> -            Result = ok(Chars0)
> +            ErrorMsg = "the line <" ++ HeadLine ++ "> contains no tab char"
>          )

Maybe just say "line N is missing a tab character"? I'm wary of dumping
unparsed garbage to the terminal. It's unlikely to come up unless
someone specifically crafts a Mercury.modules file (and, say, commits it
to a git repository), and we probably have many other places that could
be exploited, so leave it if you prefer.

> diff --git a/library/io.m b/library/io.m
> index 37ed9244e..023868296 100644
> --- a/library/io.m
> +++ b/library/io.m
> @@ -1299,6 +1299,32 @@
>  % Whole file input predicates.
>  %
>  
> +    % Open and read the named file, and if successful, return its contents
> +    % as a string. If either the opening or the reading fails, return
> +    % an error message describing the failure.
> +    %
> +    % WARNING: the returned string is NOT guaranteed to be valid UTF-8
> +    % or UTF-16.
> +    %
> +:- pred read_whole_file_as_string(string::in, io.res(string)::out,
> +    io::di, io::uo) is det.

"whole" doesn't distinguish it from read_file_as_string.
Perhaps "read_named_file_as_string"?

> +
> +    % Open and read the named file, and if successful, return its contents
> +    % as a list of lines. If either the opening or the reading fails, return
> +    % an error message describing the failure.
> +    %
> +    % This predicate views files as consisting of a sequence of lines,
> +    % with each line consisting of a possibly empty sequence of non-newline
> +    % characters, followed either by a newline character, or by the
> +    % end of the file. The string returned for each line will not contain
> +    % the newline character.
> +    %
> +    % WARNING: the returned string is NOT guaranteed to be valid UTF-8
> +    % or UTF-16.
> +    %
> +:- pred read_whole_file_as_lines(string::in, io.res(list(string))::out,
> +    io::di, io::uo) is det.
> +

Same here.

> @@ -9132,6 +9158,45 @@ flush_binary_output(binary_output_stream(Stream), !IO) :-
>  % Whole file input predicates.
>  %
>  
> +read_whole_file_as_string(FileName, Result, !IO) :-
> +    io.open_input(FileName, OpenResult, !IO),
> +    (
> +        OpenResult = ok(FileStream),
> +        io.read_file_as_string(FileStream, ReadResult, !IO),
> +        (
> +            ReadResult = ok(String),
> +            Result = ok(String)
> +        ;
> +            ReadResult = error(_PartialString, Error),
> +            Result = error(Error)
> +        ),
> +        io.close_input(FileStream, !IO)
> +    ;
> +        OpenResult = error(Error),
> +        Result = error(Error)
> +    ).
> +
> +read_whole_file_as_lines(FileName, Result, !IO) :-
> +    io.open_input(FileName, OpenResult, !IO),
> +    (
> +        OpenResult = ok(FileStream),
> +        io.read_file_as_string(FileStream, ReadResult, !IO),

Ideally, this predicate would not create one big string first.
It would require a variation on io.read_line_as_string that does not
include the terminating newline in its result.

> +        (
> +            ReadResult = ok(String),
> +            Lines = split_into_lines(String),
> +            Result = ok(Lines)
> +        ;
> +            ReadResult = error(_PartialString, Error),
> +            Result = error(Error)
> +        ),
> +        io.close_input(FileStream, !IO)
> +    ;
> +        OpenResult = error(Error),
> +        Result = error(Error)
> +    ).
> +
> +%---------------------%
> +
>  read_file(Result, !IO) :-
>      input_stream(Stream, !IO),
>      read_file(Stream, Result, !IO).

> diff --git a/library/string.m b/library/string.m
> index c36b5a7af..d08c5ef53 100644
> --- a/library/string.m
> +++ b/library/string.m
> @@ -956,6 +956,14 @@
>      %
>  :- func split_at_string(string, string) = list(string).
>  
> +    % split_into_lines(String) break String into a sequence of lines,

breaks

> +    % with each line consisting of a possibly empty sequence of non-newline
> +    % characters, followed either by a newline character, or by the end
> +    % of the string. The string returned for a line will not contain
> +    % the newline character.
> +    %
> +:- func split_into_lines(string) = list(string).
> +
>  %---------------------------------------------------------------------------%
>  %
>  % Dealing with prefixes and suffixes.

> +split_at_string_loop(Separator, SeparatorLen, Str, CurPos, Segments) :-
> +    ( if unsafe_sub_string_search_start(Str, Separator, CurPos, SepPos) then
> +        HeadSegment = unsafe_between(Str, CurPos, SepPos),
> +        % This call is not tail recursive. In the vast majority of cases,
> +        % where Str is not very long, this is ok. Where Str *is* very long,
> +        % performance may be bad, *unless* we compile this file with
> +        % --optimize-constructor-last-call.
> +        split_at_string_loop(Separator, SeparatorLen,
> +            Str, SepPos + SeparatorLen, TailSegments),
> +        Segments = [HeadSegment | TailSegments]

We compile with --optimize-constructor-last-call by default.

>      else
> -        unsafe_between(Total, StartAt, length(Total), Last),
> -        Out = [Last]
> +        unsafe_between(Str, CurPos, string.length(Str), LastSegment),
> +        Segments = [LastSegment]
> +    ).
> +
> +%---------------------%
> +
> +split_into_lines(Str) = Lines :-
> +    split_into_lines_loop(Str, 0, cord.init, LinesCord),
> +    Lines = cord.list(LinesCord).

I would just use list(T). list.reverse is simpler and more efficient
than cord.to_list, and there's no risk of confusion here IMO.

> +
> +:- pred split_into_lines_loop(string::in, int::in,
> +    cord(string)::in, cord(string)::out) is det.
> +
> +split_into_lines_loop(Str, CurPos, !LinesCord) :-
> +    ( if unsafe_sub_string_search_start(Str, "\n", CurPos, SepPos) then
> +        Line = unsafe_between(Str, CurPos, SepPos),
> +        !:LinesCord = cord.snoc(!.LinesCord, Line),
> +        % Unlike split_at_string, split_into_lines can absolutely be expected
> +        % to be invoked on huge strings fairly frequentl, which is why
> +        % this is a tail call. The price of this optimization is the call
> +        % to cord.list in our parent.
> +        split_into_lines_loop(Str, SepPos + 1, !LinesCord)

frequently

But I suggest:

    split_into_lines is used by io.read_whole_file_as_lines
    so may be invoked on potentially huge strings,
    which is why we ensure the following is a tail call.

> +    else
> +        unsafe_between(Str, CurPos, string.length(Str), LastLine),
> +        ( if LastLine = "" then
> +            true

Could check for CurPos = string.length(Str) instead.

> +        else
> +            !:LinesCord = cord.snoc(!.LinesCord, LastLine)
> +        )
>      ).
>  
>  %---------------------------------------------------------------------------%

Peter
Previous message: [m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines
Next message: [m-rev.] for review: read_whole_file_as_{string, lines}, split_into_lines
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the reviews mailing list