[m-rev.] for review: escape all control characters in io.write, deconstruct.functor

Julien Fischer jfischer at opturion.com
Fri Jun 15 18:02:56 AEST 2018


For review by anyone.

I'll update the NEWS file seprately.

----------------------------------------------------

Escape all control characters in io.write, deconstruct.functor etc.

The above predicates currently escape all of the C0 control characters (+
Delete).  This change modifies them to escape all of the characters in the
Unicode category `Other,control' using backslash escapes when they exist and
octal escapes otherwise.

library/term_io.m:
     Do not treat C1 control characters as Mercury source characters.

     Re-order the list of Mercury punctuation characters by codepoint
     order; it is difficult to check for completion otherwise.

     Put a list of special characters escapes in order.

runtime/mercury_ml_expand_body.h:
library/rtti_implementation.m:
    Update the implementations of functor/4 to escape all control
    characters when returning the functor of a character.

library/deconstruct.m:
     Specify that functor/4 should escape all control characters in
     the value returned for characters and strings.  (XXX TODO: it
     currently doesn't implement the new behaviour for strings; I'll
     add that separately.)

library/io.m:
library/stream.string_writer.m:
     Similar to the above but for io.write etc.

tests/hard_coded/write.{m,exp}:
tests/hard_coded/deconstruct_arg.{m,exp,exp2}:
    Extend these tests to cover the block of C1 control characters
    and the boundaries around it.

Julien.

diff --git a/library/deconstruct.m b/library/deconstruct.m
index b957cf6..0bfb506 100644
--- a/library/deconstruct.m
+++ b/library/deconstruct.m
@@ -74,11 +74,15 @@
      %     handled as if it had standard equality.
      %   - for integers, the string is a base 10 number;
      %     positive integers have no sign.
-    %   - for finite floats, the string is a floating point, base 10 number;
-    %     positive floating point numbers have no sign.
-    %   - for infinite floats, the string "infinity" or "-infinity";
-    %   - for strings, the string, inside double quotation marks
-    %   - for characters, the character inside single quotation marks
+    %   - for finite floats, the string is a base 10 floating point number;
+    %     positive floating point numbers have no sign;
+    %     for infinite floats, the string "infinity" or "-infinity".
+    %   - for strings, the string, inside double quotation marks using
+    %     backslash escapes if necessary and backslash or octal escapes for
+    %     all characters for which char.is_control/1 is true.
+    %   - for characters, the character inside single quotation marks using
+    %     a backslash escape if necssary and a backslash or octal escape for
+    %     for all characters for which char.is_control/1 is true.
      %   - for predicates, the string <<predicate>>, and for functions,
      %     the string <<function>>, except with include_details_cc,
      %     in which case it will be the predicate or function name.
diff --git a/library/io.m b/library/io.m
index 40baad0..5c3e6a7 100644
--- a/library/io.m
+++ b/library/io.m
@@ -430,14 +430,16 @@
      % be valid Mercury syntax whenever possible.
      %
      % Strings and characters are always printed out in quotes, using backslash
-    % escapes if necessary. For higher-order types, or for types defined using
-    % the foreign language interface (pragma foreign_type), the text output
-    % will only describe the type that is being printed, not the value, and the
-    % result may not be parsable by `read'. For the types containing
-    % existential quantifiers, the type `type_desc' and closure types, the
-    % result may not be parsable by `read', either. But in all other cases
-    % the format used is standard Mercury syntax, and if you append a period
-    % and newline (".\n"), then the results can be read in again using `read'.
+    % escapes if necessary and backslash or octal escapes for all characters
+    % for which char.is_control/1 is true. For higher-order types, or for types
+    % defined using the foreign language interface (pragma foreign_type), the
+    % text output will only describe the type that is being printed, not the
+    % value, and the result may not be parsable by `read'.  For the types
+    % containing existential quantifiers, the type `type_desc' and closure
+    % types, the result may not be parsable by `read', either. But in all other
+    % cases the format used is standard Mercury syntax, and if you append a
+    % period and newline (".\n"), then the results can be read in again using
+    % `read'.
      %
      % write/5 is the same as write/4 except that it allows the caller
      % to specify how non-canonical types should be handled. write_cc/3
diff --git a/library/rtti_implementation.m b/library/rtti_implementation.m
index 9ed3b82..45967aa 100644
--- a/library/rtti_implementation.m
+++ b/library/rtti_implementation.m
@@ -2819,11 +2819,9 @@ deconstruct_2(Term, TypeInfo, TypeCtorInfo, TypeCtorRep, NonCanon,
          ( if quote_special_escape_char(Char, EscapedChar) then
              Functor = EscapedChar
          else if
-            Int = char.to_int(Char),
-            ( 0x0 =< Int, Int =< 0x1f
-            ; Int = 0x7f
-            )
+            char.is_control(Char)
          then
+            char.to_int(Char, Int),
              string.int_to_base_string(Int, 8, OctalString0),
              string.pad_left(OctalString0, '0', 3, OctalString),
              Functor  = "'\\" ++ OctalString ++ "\\'"
diff --git a/library/stream.string_writer.m b/library/stream.string_writer.m
index c276831..d2df2c7 100644
--- a/library/stream.string_writer.m
+++ b/library/stream.string_writer.m
@@ -124,14 +124,16 @@
      % valid Mercury syntax whenever possible.
      %
      % Strings and characters are always printed out in quotes, using backslash
-    % escapes if necessary.  For higher-order types, or for types defined using
-    % the foreign language interface (pragma foreign_type), the text output
-    % will only describe the type that is being printed, not the value, and the
-    % result may not be parsable by `read'.  For the types containing
-    % existential quantifiers, the type `type_desc' and closure types, the
-    % result may not be parsable by `read', either.  But in all other cases the
-    % format used is standard Mercury syntax, and if you append a period and
-    % newline (".\n"), then the results can be read in again using `read'.
+    % escapes if necessary and backslash or octal escapes for all characters
+    % for which char.is_control/1 is true. For higher-order types, or for types
+    % defined using the foreign language interface (pragma foreign_type), the
+    % text output will only describe the type that is being printed, not the
+    % value, and the result may not be parsable by `read'.  For the types
+    % containing existential quantifiers, the type `type_desc' and closure
+    % types, the result may not be parsable by `read', either.  But in all
+    % other cases the format used is standard Mercury syntax, and if you append
+    % a period and newline (".\n"), then the results can be read in again using
+    % `read'.
      %
      % write/5 is the same as write/4 except that it allows the caller to
      % specify how non-canonical types should be handled.  write_cc/4 is the
diff --git a/library/term_io.m b/library/term_io.m
index eeaef88..2f4e627 100644
--- a/library/term_io.m
+++ b/library/term_io.m
@@ -785,7 +785,7 @@ string_is_escaped_char(Char::out, String::in) :-
  is_mercury_source_char(Char) :-
      ( char.is_alnum(Char)
      ; is_mercury_punctuation_char(Char)
-    ; char.to_int(Char) >= 0x80
+    ; char.to_int(Char) >= 0xA0  % 0x7f - 0x9f are control characters.
      ).

  %---------------------------------------------------------------------------%
@@ -942,39 +942,43 @@ mercury_escape_char(Char) = EscapeCode :-
      % Note: the code here is similar to code in runtime/mercury_trace_base.c;
      % any changes here may require similar changes there.

+% Codepoints: 0x20 -> 0x2f.
  is_mercury_punctuation_char(' ').
  is_mercury_punctuation_char('!').
-is_mercury_punctuation_char('@').
+is_mercury_punctuation_char('"').
  is_mercury_punctuation_char('#').
  is_mercury_punctuation_char('$').
  is_mercury_punctuation_char('%').
-is_mercury_punctuation_char('^').
  is_mercury_punctuation_char('&').
-is_mercury_punctuation_char('*').
+is_mercury_punctuation_char('''').
  is_mercury_punctuation_char('(').
  is_mercury_punctuation_char(')').
-is_mercury_punctuation_char('-').
-is_mercury_punctuation_char('_').
+is_mercury_punctuation_char('*').
  is_mercury_punctuation_char('+').
-is_mercury_punctuation_char('=').
-is_mercury_punctuation_char('`').
-is_mercury_punctuation_char('~').
-is_mercury_punctuation_char('{').
-is_mercury_punctuation_char('}').
-is_mercury_punctuation_char('[').
-is_mercury_punctuation_char(']').
-is_mercury_punctuation_char(';').
+is_mercury_punctuation_char(',').
+is_mercury_punctuation_char('-').
+is_mercury_punctuation_char('.').
+is_mercury_punctuation_char('/').
+% Codepoints: 0x3a -> 0x40.
  is_mercury_punctuation_char(':').
-is_mercury_punctuation_char('''').
-is_mercury_punctuation_char('"').
+is_mercury_punctuation_char(';').
  is_mercury_punctuation_char('<').
+is_mercury_punctuation_char('=').
  is_mercury_punctuation_char('>').
-is_mercury_punctuation_char('.').
-is_mercury_punctuation_char(',').
-is_mercury_punctuation_char('/').
  is_mercury_punctuation_char('?').
+is_mercury_punctuation_char('@').
+% Codepoints: 0x5b -> 0x60.
+is_mercury_punctuation_char('[').
  is_mercury_punctuation_char('\\').
+is_mercury_punctuation_char(']').
+is_mercury_punctuation_char('^').
+is_mercury_punctuation_char('_').
+is_mercury_punctuation_char('`').
+% Codpoints: 0x7b -> 0x7e.
+is_mercury_punctuation_char('{').
  is_mercury_punctuation_char('|').
+is_mercury_punctuation_char('~').
+is_mercury_punctuation_char('}').

  %---------------------------------------------------------------------------%

@@ -1012,10 +1016,10 @@ encode_escaped_char(Char::out, Str::in) :-

  mercury_escape_special_char('\a', 'a').
  mercury_escape_special_char('\b', 'b').
-mercury_escape_special_char('\r', 'r').
  mercury_escape_special_char('\f', 'f').
-mercury_escape_special_char('\t', 't').
  mercury_escape_special_char('\n', 'n').
+mercury_escape_special_char('\r', 'r').
+mercury_escape_special_char('\t', 't').
  mercury_escape_special_char('\v', 'v').
  mercury_escape_special_char('\\', '\\').
  mercury_escape_special_char('''', '''').
diff --git a/runtime/mercury_ml_expand_body.h b/runtime/mercury_ml_expand_body.h
index 0db8252..8ec6089 100644
--- a/runtime/mercury_ml_expand_body.h
+++ b/runtime/mercury_ml_expand_body.h
@@ -893,10 +893,15 @@ EXPAND_FUNCTION_NAME(MR_TypeInfo type_info, MR_Word *data_word_ptr,
                      case '\n': str_ptr = "'\\n'";  break;
                      case '\v': str_ptr = "'\\v'";  break;
                      default:
-                        // Print C0 control characters and Delete in
-                        // octal.
-                        if (data_word <= 0x1f || data_word == 0x7f) {
-                            sprintf(buf, "\'\\%03o\\\'", data_word);
+                        // Print remaining control characters using octal
+                        // escapes.
+                        if ( 
+                            (0x00 <= data_word && data_word <= 0x1f) ||
+                            (0x7f <= data_word && data_word <= 0x9f)
+                        ) { 
+                            sprintf(buf,
+                                "\'\\%03" MR_INTEGER_LENGTH_MODIFIER "o\\\'",
+                                data_word);
                          } else if (MR_is_ascii(data_word)) {
                              sprintf(buf, "\'%c\'", (char) data_word);
                          } else if (MR_is_surrogate(data_word)) {
diff --git a/tests/hard_coded/deconstruct_arg.exp b/tests/hard_coded/deconstruct_arg.exp
index 49f7a28..7f4ef4f 100644
--- a/tests/hard_coded/deconstruct_arg.exp
+++ b/tests/hard_coded/deconstruct_arg.exp
@@ -264,6 +264,20 @@ deconstruct deconstruct: functor '\'' arity 0
  deconstruct limited deconstruct 3 of '\''
  functor '\'' arity 0 []

+deconstruct functor: '~'/0
+deconstruct argument 0 of '~' doesn't exist
+deconstruct argument 1 of '~' doesn't exist
+deconstruct argument 2 of '~' doesn't exist
+deconstruct argument 'moo' doesn't exist
+deconstruct argument 'mooo!' doesn't exist
+deconstruct argument 'packed1' doesn't exist
+deconstruct argument 'packed2' doesn't exist
+deconstruct argument 'packed3' doesn't exist
+deconstruct deconstruct: functor '~' arity 0
+[]
+deconstruct limited deconstruct 3 of '~'
+functor '~' arity 0 []
+
  deconstruct functor: '\001\'/0
  deconstruct argument 0 of '\001\' doesn't exist
  deconstruct argument 1 of '\001\' doesn't exist
@@ -306,6 +320,48 @@ deconstruct deconstruct: functor '\177\' arity 0
  deconstruct limited deconstruct 3 of '\177\'
  functor '\177\' arity 0 []

+deconstruct functor: '\200\'/0
+deconstruct argument 0 of '\200\' doesn't exist
+deconstruct argument 1 of '\200\' doesn't exist
+deconstruct argument 2 of '\200\' doesn't exist
+deconstruct argument 'moo' doesn't exist
+deconstruct argument 'mooo!' doesn't exist
+deconstruct argument 'packed1' doesn't exist
+deconstruct argument 'packed2' doesn't exist
+deconstruct argument 'packed3' doesn't exist
+deconstruct deconstruct: functor '\200\' arity 0
+[]
+deconstruct limited deconstruct 3 of '\200\'
+functor '\200\' arity 0 []
+
+deconstruct functor: '\237\'/0
+deconstruct argument 0 of '\237\' doesn't exist
+deconstruct argument 1 of '\237\' doesn't exist
+deconstruct argument 2 of '\237\' doesn't exist
+deconstruct argument 'moo' doesn't exist
+deconstruct argument 'mooo!' doesn't exist
+deconstruct argument 'packed1' doesn't exist
+deconstruct argument 'packed2' doesn't exist
+deconstruct argument 'packed3' doesn't exist
+deconstruct deconstruct: functor '\237\' arity 0
+[]
+deconstruct limited deconstruct 3 of '\237\'
+functor '\237\' arity 0 []
+
+deconstruct functor: ' '/0
+deconstruct argument 0 of ' ' doesn't exist
+deconstruct argument 1 of ' ' doesn't exist
+deconstruct argument 2 of ' ' doesn't exist
+deconstruct argument 'moo' doesn't exist
+deconstruct argument 'mooo!' doesn't exist
+deconstruct argument 'packed1' doesn't exist
+deconstruct argument 'packed2' doesn't exist
+deconstruct argument 'packed3' doesn't exist
+deconstruct deconstruct: functor ' ' arity 0
+[]
+deconstruct limited deconstruct 3 of ' '
+functor ' ' arity 0 []
+
  deconstruct functor: 'Ω'/0
  deconstruct argument 0 of 'Ω' doesn't exist
  deconstruct argument 1 of 'Ω' doesn't exist
@@ -544,7 +600,7 @@ deconstruct deconstruct: functor newline arity 0
  deconstruct limited deconstruct 3 of '<<predicate>>'
  functor newline arity 0 []

-deconstruct functor: lambda_deconstruct_arg_m_176/1
+deconstruct functor: lambda_deconstruct_arg_m_182/1
  deconstruct argument 0 of '<<predicate>>' is [1, 2]
  deconstruct argument 1 of '<<predicate>>' doesn't exist
  deconstruct argument 2 of '<<predicate>>' doesn't exist
@@ -553,10 +609,10 @@ deconstruct argument 'mooo!' doesn't exist
  deconstruct argument 'packed1' doesn't exist
  deconstruct argument 'packed2' doesn't exist
  deconstruct argument 'packed3' doesn't exist
-deconstruct deconstruct: functor lambda_deconstruct_arg_m_176 arity 1
+deconstruct deconstruct: functor lambda_deconstruct_arg_m_182 arity 1
  [[1, 2]]
  deconstruct limited deconstruct 3 of '<<predicate>>'
-functor lambda_deconstruct_arg_m_176 arity 1 [[1, 2]]
+functor lambda_deconstruct_arg_m_182 arity 1 [[1, 2]]

  deconstruct functor: p/3
  deconstruct argument 0 of '<<predicate>>' is 1
diff --git a/tests/hard_coded/deconstruct_arg.exp2 b/tests/hard_coded/deconstruct_arg.exp2
index 349ed1c..bc508fa 100644
--- a/tests/hard_coded/deconstruct_arg.exp2
+++ b/tests/hard_coded/deconstruct_arg.exp2
@@ -264,6 +264,20 @@ deconstruct deconstruct: functor '\'' arity 0
  deconstruct limited deconstruct 3 of '\''
  functor '\'' arity 0 []

+deconstruct functor: '~'/0
+deconstruct argument 0 of '~' doesn't exist
+deconstruct argument 1 of '~' doesn't exist
+deconstruct argument 2 of '~' doesn't exist
+deconstruct argument 'moo' doesn't exist
+deconstruct argument 'mooo!' doesn't exist
+deconstruct argument 'packed1' doesn't exist
+deconstruct argument 'packed2' doesn't exist
+deconstruct argument 'packed3' doesn't exist
+deconstruct deconstruct: functor '~' arity 0
+[]
+deconstruct limited deconstruct 3 of '~'
+functor '~' arity 0 []
+
  deconstruct functor: '\001\'/0
  deconstruct argument 0 of '\001\' doesn't exist
  deconstruct argument 1 of '\001\' doesn't exist
@@ -306,6 +320,48 @@ deconstruct deconstruct: functor '\177\' arity 0
  deconstruct limited deconstruct 3 of '\177\'
  functor '\177\' arity 0 []

+deconstruct functor: '\200\'/0
+deconstruct argument 0 of '\200\' doesn't exist
+deconstruct argument 1 of '\200\' doesn't exist
+deconstruct argument 2 of '\200\' doesn't exist
+deconstruct argument 'moo' doesn't exist
+deconstruct argument 'mooo!' doesn't exist
+deconstruct argument 'packed1' doesn't exist
+deconstruct argument 'packed2' doesn't exist
+deconstruct argument 'packed3' doesn't exist
+deconstruct deconstruct: functor '\200\' arity 0
+[]
+deconstruct limited deconstruct 3 of '\200\'
+functor '\200\' arity 0 []
+
+deconstruct functor: '\237\'/0
+deconstruct argument 0 of '\237\' doesn't exist
+deconstruct argument 1 of '\237\' doesn't exist
+deconstruct argument 2 of '\237\' doesn't exist
+deconstruct argument 'moo' doesn't exist
+deconstruct argument 'mooo!' doesn't exist
+deconstruct argument 'packed1' doesn't exist
+deconstruct argument 'packed2' doesn't exist
+deconstruct argument 'packed3' doesn't exist
+deconstruct deconstruct: functor '\237\' arity 0
+[]
+deconstruct limited deconstruct 3 of '\237\'
+functor '\237\' arity 0 []
+
+deconstruct functor: ' '/0
+deconstruct argument 0 of ' ' doesn't exist
+deconstruct argument 1 of ' ' doesn't exist
+deconstruct argument 2 of ' ' doesn't exist
+deconstruct argument 'moo' doesn't exist
+deconstruct argument 'mooo!' doesn't exist
+deconstruct argument 'packed1' doesn't exist
+deconstruct argument 'packed2' doesn't exist
+deconstruct argument 'packed3' doesn't exist
+deconstruct deconstruct: functor ' ' arity 0
+[]
+deconstruct limited deconstruct 3 of ' '
+functor ' ' arity 0 []
+
  deconstruct functor: 'Ω'/0
  deconstruct argument 0 of 'Ω' doesn't exist
  deconstruct argument 1 of 'Ω' doesn't exist
diff --git a/tests/hard_coded/deconstruct_arg.m b/tests/hard_coded/deconstruct_arg.m
index 88dddd4..4e7f89f 100644
--- a/tests/hard_coded/deconstruct_arg.m
+++ b/tests/hard_coded/deconstruct_arg.m
@@ -130,11 +130,17 @@ main(!IO) :-
      test_all('\v', !IO),
      test_all('\\', !IO),
      test_all('\'', !IO),
+    test_all('~', !IO),

      % test C0 control characters
-    test_all('\1\', !IO),
-    test_all('\37\', !IO),
+    test_all('\001\', !IO),
+    test_all('\037\', !IO),
      test_all('\177\', !IO),
+    % test C1 control characters
+    test_all('\200\', !IO),
+    test_all('\237\', !IO),
+    % No-break space (next codepoint after C1 control characters)
+    test_all('\240\', !IO),

      % test a character that requires more than one byte in its
      % UTF-8 encoding.
diff --git a/tests/hard_coded/write.exp b/tests/hard_coded/write.exp
index f9e16ef..91faf21 100644
--- a/tests/hard_coded/write.exp
+++ b/tests/hard_coded/write.exp
@@ -29,8 +29,11 @@ TESTING BUILTINS
  "Foo%sFoo"
  "\""
  "\a\b\f\t\n\r\v\"\\"
+"\001\\037\\177\\200\\237\ "
  'a'
+'A'
  '&'
+'\001\'
  '\a'
  '\b'
  '\f'
@@ -38,9 +41,17 @@ TESTING BUILTINS
  '\n'
  '\r'
  '\v'
+'\037\'
+' '
  '\''
  '\\'
  '\"'
+'~'
+'\177\'
+'\200\'
+'\237\'
+' '
+0.0
  3.14159
  1.128324983e-21
  2.23954899e+23
diff --git a/tests/hard_coded/write.m b/tests/hard_coded/write.m
index 700f5ee..168e50c 100644
--- a/tests/hard_coded/write.m
+++ b/tests/hard_coded/write.m
@@ -12,6 +12,7 @@
  :- implementation.

  :- import_module array.
+:- import_module char.
  :- import_module float.
  :- import_module int.
  :- import_module list.
@@ -127,10 +128,14 @@ test_builtins(!IO) :-
      io.write_line("Foo%sFoo", !IO),
      io.write_line("""", !IO),    % interesting - prints """ of course
      io.write_line("\a\b\f\t\n\r\v\"\\", !IO),
+    io.write_line("\001\\037\\177\\200\\237\\240\", !IO),

      % Test characters.
      io.write_line('a', !IO),
+    io.write_line('A', !IO),
      io.write_line('&', !IO),
+
+    io.write_line('\001\', !IO), % Second C0 control.
      io.write_line('\a', !IO),
      io.write_line('\b', !IO),
      io.write_line('\f', !IO),
@@ -138,11 +143,21 @@ test_builtins(!IO) :-
      io.write_line('\n', !IO),
      io.write_line('\r', !IO),
      io.write_line('\v', !IO),
+    io.write_line('\037\', !IO), % Last C0 control.
+    io.write_line(' ', !IO),
+
      io.write_line('\'', !IO),
      io.write_line(('\\') : character, !IO),
      io.write_line('\"', !IO),

+    io.write_line('~', !IO),
+    io.write_line('\177\', !IO), % Delete.
+    io.write_line('\200\', !IO), % First C1 control.
+    io.write_line('\237\', !IO), % Last C1 control.
+    io.write_line('\240\', !IO), % No-break space.
+
      % Test floats.
+    io.write_line(0.0, !IO),
      io.write_line(3.14159, !IO),
      io.write_line(11.28324983E-22, !IO),
      io.write_line(22.3954899E22, !IO),


More information about the reviews mailing list