[m-rev.] for review: Reduce memory allocation in string.to_upper, string.to_lower.

Peter Wang novalazy at gmail.com
Tue Jun 21 11:23:50 AEST 2016


On Mon, 20 Jun 2016 10:14:54 +0200 (CEST), "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> 
> 
> On Mon, 20 Jun 2016 14:13:13 +1000, Peter Wang <novalazy at gmail.com> wrote:
> > +:- pragma foreign_proc("C",
> > +    to_upper(StrIn::in, StrOut::uo),
> > +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,
> > +        does_not_affect_liveness, no_sharing],
> > +"
> > +    MR_Integer  i;
> > +
> > +    MR_make_aligned_string_copy_msg(StrOut, StrIn, MR_ALLOC_ID);
> > +
> > +    for (i = 0; StrOut[i] != '\\0'; i++) {
> > +        if (StrOut[i] >= 'a' && StrOut[i] <= 'z') {
> > +            StrOut[i] = StrOut[i] - 'a' + 'A';
> > +        }
> > +    }
> > +").
> 
> The test here (and in to_lower) will do the wrong thing on machines that use EBCDIC.
> While C# and Java won't use it, C might.

Hi Zoltan,

I think you are referring to the fact that letters are not in contiguous
ranges in EBCDIC.  We assume (when targeting C) that Mercury strings
are UTF-8 encoded, so the problem here should be that we are comparing
a UTF-8 code unit with a C character literal that may be compiled to
something other than its ASCII/Unicode value.

One solution would be to hard code the ASCII values (behind an enum)
instead of using C character literals.  The same solution would apply
to other predicates likes is_all_alpha, etc.

Incompatible character encodings also affects the use of C string
literals to hold Mercury strings.  We could write out strings as
octal/hexadecimal escape sequences of their UTF-8 code units.

We also need to be careful anywhere that Mercury strings are passed
to/from system functions, similar to how we convert file paths to
UTF-16 on Windows.

We may need to convert encodings in FILE streams.

There may be compiler help available.
I noticed #pragma convert and #pragma convlit on some IBM pages.

We can document the assumption that character literals represent ASCII
values in to_upper, etc. but I think that is all we should do for now.

Peter


More information about the reviews mailing list