[m-dev.] Unicode support in Mercury
matt.giuca at gmail.com
Mon May 9 17:42:53 AEST 2011
> In my experience, `length' is mainly used for two things: checking that
> an index is within range, and getting the end point when iterating over
> a string. In both cases it doesn't matter what `length' returns, but it
> must be compatible with the indexing predicates.
Right, but as I said above, the indexing predicates should deal with
characters too. So if an int is less than the length, it is valid to
pass it to index.
> I have little use for knowing how many code points are in a string.
> In the compiler, `count_codepoints' is not used. In the standard
> library, apart from implementing `by_codepoint' predicates, it is used
> for aligning strings badly (making the incorrect assumption that all
> code points take the same width when printed).
Well it isn't surprising that count_codepoints isn't used much, as it
has just been added to the standard library. What are the uses for
string.length (which deals with the number of code units). I can't
imagine many ways in which that would be useful, since it doesn't tell
you anything about the conceptual list of characters, only about the
> > index_det("日本語", 1) = '日', index_det("日本語", 2) = '日',
> These are invalid, not so different to index_det("日本語", -1) or
> index_det("日本語", 100).
OK, well that indicates a problem with string.index, if for a
contiguous block of integers it is illegal to access some of them. I
need to have a detailed understanding about the underlying string
representation in order to make use of that predicate. Whereas
string.codepoint_offset is more generally useful and returns the same
results across implementations.
If you do keep those semantics, the above point is missing from the
documentation. It states that strnig.index "Fails if `Index' is out of
range (negative, or greater than or equal to the length of `String')."
It needs to also say ", or does not represent the first code unit of a
> You have to use string.codepoint_offset. Consider it a tax on
> inefficient code.
Ah, I see. I didn't find that before. I would suggest renaming or
aliasing it to string.index_by_codepoint for consistency with the
other by_codepoint predicates, and add a _det pred and function
version for consistency with index.
> Yes, performance is what it comes down to. But IMHO working with code
> unit offsets is not much harder than working with code unit offsets.
OK, well performance is rather worse when dealing with codepoints in
UTF-8 or (correct) UTF-16 implementations, so I suppose that justifies
these defaults. Other languages (Python UCS4 build) get around it by
having UTF-32 strings, and then the codepoint access functions are
efficient. I'm guessing you wouldn't be happy with that though.
> I find it helps to think more abstractly of the offset arguments.
> The set of valid offsets for a string is not all of [0, length(Str)-1],
> but some subset. 0 is the still the start of the string. length(Str)
> is still the end. The next offset after N is _not_ N + 1, but
> `index_next' gives you the value anyway. The actual value is
> You can still do useful arithmetic with the offsets, like subtracting
> offsets to get the length of the substring. If you know you are dealing
> with an ASCII code point (extremely common), you can still add 1.
I think that last point is misleading. ASCII is extremely common in
practice, but strings which will only ever contain ASCII are extremely
rare. Most string variables in most programs might contain non-ASCII
characters but usually don't, inviting bugs such as adding 1 to an
index, which will work in almost all cases but break when you have
non-ASCII characters involved. I think that is a very good example why
having strings where valid indices are non-contiguous is a bad idea,
because it invites programmers to use N + 1, and it will tend to work,
until it doesn't.
On Mon, May 9, 2011 at 5:12 PM, Mark Brown <mark at csse.unimelb.edu.au> wrote:
> They may simply wish to know where the bytes start in memory - that's
> relevant to a lexer even though it doesn't care about the byte sequence,
> since it wants to peek at or read the next character quickly.
Surely in a truly abstract interface over strings, the lexer wouldn't
care where bytes start in memory. In a high-level language, I don't
want do deal with binary representation of strings; if I did, I
wouldn't be using a unicode-aware string library.
I feel like languages have two choices: either provide an 8-bit clean
string type (e.g., C, Lua, Go, PHP, Ruby), or provide an abstract
Unicode string type where the user doesn't need to be aware of the
representation (e.g., Java, Python). I feel like Mercury is now fairly
muddy in this choice, because it is claiming to provide an abstract
Unicode string type, but is forcing the user to deal with the bytes
still, so it's no better than the languages in the first category.
> My experience with this is that the change broke the G12 lexer, which
> was always incrementing the index by 1 (an operation that is now
> meaningless, as you say) instead of by the number of bytes in the
> character. This results in an illegal character abort for unicode
> input. While the conceptually-nice use of code-point indexes would
> have meant the lexer would continue to work correctly without
> modifications, its performance would suck even for ascii. That
> would be a costlier problem to fix than the abort.
Right, so the performance is again the issue. It wouldn't be a problem
if the by-codepoint index predicate was O(1), which it would be if the
underlying representation was UTF-32. Then the change wouldn't have
broken any code.
Anyway, I see that the performance problems won't be resolved so you
may as well keep the operations defaulting to the code-unit ones.
Thanks for clarifying those points.
mercury-developers mailing list
Post messages to: mercury-developers at csse.unimelb.edu.au
Administrative Queries: owner-mercury-developers at csse.unimelb.edu.au
Subscriptions: mercury-developers-request at csse.unimelb.edu.au
More information about the developers