[m-rev.] for review: Make MR_utf8_get return error for more ill-formed sequences.
Peter Wang
novalazy at gmail.com
Thu Sep 12 16:54:22 AEST 2019
runtime/mercury_string.c:
Make MR_utf8_get, MR_utf8_get_mb return an error for code unit
sequences that encode a surrogate code point, or a code point
greater than U+10FFFF.
runtime/mercury_string.h:
Adjust some comments.
---
runtime/mercury_string.c | 12 +++++++-----
runtime/mercury_string.h | 9 ++++++++-
2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/runtime/mercury_string.c b/runtime/mercury_string.c
index e17108143..b1753980b 100644
--- a/runtime/mercury_string.c
+++ b/runtime/mercury_string.c
@@ -1,7 +1,7 @@
// vim: ts=4 sw=4 expandtab ft=c
// Copyright (C) 2000-2002, 2006, 2011-2012 The University of Melbourne.
-// Copyright (C) 2015-2016, 2018 The Mercury team.
+// Copyright (C) 2015-2016, 2018-2019 The Mercury team.
// This file is distributed under the terms specified in COPYING.LIB.
// mercury_string.c - string handling
@@ -411,11 +411,13 @@ MR_utf8_get_mb(const MR_String s_, MR_Integer pos, int *width)
break;
}
- // Check for overlong forms, which could be used to bypass security
- // validations. We could also check code points aren't above U+10FFFF
- // or in the surrogate ranges, but we don't.
+ // Check for overlong forms or code point out of range.
+ if (c < minc || c > 0x10FFFF) {
+ return -2;
+ }
- if (c < minc) {
+ // Check for surrogate code points.
+ if (MR_is_surrogate(c)) {
return -2;
}
diff --git a/runtime/mercury_string.h b/runtime/mercury_string.h
index 3512284f0..d3cb00262 100644
--- a/runtime/mercury_string.h
+++ b/runtime/mercury_string.h
@@ -440,21 +440,28 @@ extern MR_bool MR_escape_string_quote(MR_String *ptr,
#define MR_utf8_is_lead_byte(c) (((unsigned) (c) - 0xC0) < 0x3E)
#define MR_utf8_is_trail_byte(c) (((unsigned) (c) & 0xC0) == 0x80)
+// XXX ILSEQ The following functions should be rethought to make dealing
+// with ill-formed code unit sequences easier.
+
// Advance `*pos' to the beginning of the next code point in `s'.
// If `*pos' is already at the end of the string, return MR_FALSE
// without modifying `*pos'.
+// This function simply searches for a single or lead byte without decoding
+// so may skip over bytes in ill-formed sequences.
extern MR_bool MR_utf8_next(const MR_String s_, MR_Integer *pos);
// Rewind `*pos' to the beginning of the previous code point in `s'.
// If `*pos' is already at the beginning of the string, return MR_FALSE
// without modifying `*pos'.
+// This function simply searches for a single or lead byte without decoding
+// so may skip over bytes in ill-formed sequences.
extern MR_bool MR_utf8_prev(const MR_String s_, MR_Integer *pos);
// Decode and return the code point beginning at `pos' in `s'.
// Return 0 if at the end of the string (i.e. the NUL terminator).
-// If an illegal code sequence exists at that offset, return -2.
+// Return -2 if the code unit sequence beginning at that offset is ill-formed.
//
// The _mb version requires s[pos] to be the lead byte of a multibyte code
// point.
--
2.23.0
More information about the reviews
mailing list