[m-rev.] for review: Make MR_utf8_get return error for more ill-formed sequences.

Peter Wang novalazy at gmail.com
Thu Sep 12 16:54:22 AEST 2019


runtime/mercury_string.c:
    Make MR_utf8_get, MR_utf8_get_mb return an error for code unit
    sequences that encode a surrogate code point, or a code point
    greater than U+10FFFF.

runtime/mercury_string.h:
    Adjust some comments.
---
 runtime/mercury_string.c | 12 +++++++-----
 runtime/mercury_string.h |  9 ++++++++-
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/runtime/mercury_string.c b/runtime/mercury_string.c
index e17108143..b1753980b 100644
--- a/runtime/mercury_string.c
+++ b/runtime/mercury_string.c
@@ -1,7 +1,7 @@
 // vim: ts=4 sw=4 expandtab ft=c
 
 // Copyright (C) 2000-2002, 2006, 2011-2012 The University of Melbourne.
-// Copyright (C) 2015-2016, 2018 The Mercury team.
+// Copyright (C) 2015-2016, 2018-2019 The Mercury team.
 // This file is distributed under the terms specified in COPYING.LIB.
 
 // mercury_string.c - string handling
@@ -411,11 +411,13 @@ MR_utf8_get_mb(const MR_String s_, MR_Integer pos, int *width)
             break;
     }
 
-    // Check for overlong forms, which could be used to bypass security
-    // validations. We could also check code points aren't above U+10FFFF
-    // or in the surrogate ranges, but we don't.
+    // Check for overlong forms or code point out of range.
+    if (c < minc || c > 0x10FFFF) {
+        return -2;
+    }
 
-    if (c < minc) {
+    // Check for surrogate code points.
+    if (MR_is_surrogate(c)) {
         return -2;
     }
 
diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
index 3512284f0..d3cb00262 100644
--- a/runtime/mercury_string.h
+++ b/runtime/mercury_string.h
@@ -440,21 +440,28 @@ extern MR_bool MR_escape_string_quote(MR_String *ptr,
 #define MR_utf8_is_lead_byte(c)     (((unsigned) (c) - 0xC0) < 0x3E)
 #define MR_utf8_is_trail_byte(c)    (((unsigned) (c) & 0xC0) == 0x80)
 
+// XXX ILSEQ The following functions should be rethought to make dealing
+// with ill-formed code unit sequences easier.
+
 // Advance `*pos' to the beginning of the next code point in `s'.
 // If `*pos' is already at the end of the string, return MR_FALSE
 // without modifying `*pos'.
+// This function simply searches for a single or lead byte without decoding
+// so may skip over bytes in ill-formed sequences.
 
 extern MR_bool          MR_utf8_next(const MR_String s_, MR_Integer *pos);
 
 // Rewind `*pos' to the beginning of the previous code point in `s'.
 // If `*pos' is already at the beginning of the string, return MR_FALSE
 // without modifying `*pos'.
+// This function simply searches for a single or lead byte without decoding
+// so may skip over bytes in ill-formed sequences.
 
 extern MR_bool          MR_utf8_prev(const MR_String s_, MR_Integer *pos);
 
 // Decode and return the code point beginning at `pos' in `s'.
 // Return 0 if at the end of the string (i.e. the NUL terminator).
-// If an illegal code sequence exists at that offset, return -2.
+// Return -2 if the code unit sequence beginning at that offset is ill-formed.
 //
 // The _mb version requires s[pos] to be the lead byte of a multibyte code
 // point.
-- 
2.23.0



More information about the reviews mailing list