[m-rev.] for review: Reduce memory allocation in string.to_upper, string.to_lower.
Peter Wang
novalazy at gmail.com
Tue Jun 21 11:23:50 AEST 2016
On Mon, 20 Jun 2016 10:14:54 +0200 (CEST), "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
>
>
> On Mon, 20 Jun 2016 14:13:13 +1000, Peter Wang <novalazy at gmail.com> wrote:
> > +:- pragma foreign_proc("C",
> > + to_upper(StrIn::in, StrOut::uo),
> > + [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail,
> > + does_not_affect_liveness, no_sharing],
> > +"
> > + MR_Integer i;
> > +
> > + MR_make_aligned_string_copy_msg(StrOut, StrIn, MR_ALLOC_ID);
> > +
> > + for (i = 0; StrOut[i] != '\\0'; i++) {
> > + if (StrOut[i] >= 'a' && StrOut[i] <= 'z') {
> > + StrOut[i] = StrOut[i] - 'a' + 'A';
> > + }
> > + }
> > +").
>
> The test here (and in to_lower) will do the wrong thing on machines that use EBCDIC.
> While C# and Java won't use it, C might.
Hi Zoltan,
I think you are referring to the fact that letters are not in contiguous
ranges in EBCDIC. We assume (when targeting C) that Mercury strings
are UTF-8 encoded, so the problem here should be that we are comparing
a UTF-8 code unit with a C character literal that may be compiled to
something other than its ASCII/Unicode value.
One solution would be to hard code the ASCII values (behind an enum)
instead of using C character literals. The same solution would apply
to other predicates likes is_all_alpha, etc.
Incompatible character encodings also affects the use of C string
literals to hold Mercury strings. We could write out strings as
octal/hexadecimal escape sequences of their UTF-8 code units.
We also need to be careful anywhere that Mercury strings are passed
to/from system functions, similar to how we convert file paths to
UTF-16 on Windows.
We may need to convert encodings in FILE streams.
There may be compiler help available.
I noticed #pragma convert and #pragma convlit on some IBM pages.
We can document the assumption that character literals represent ASCII
values in to_upper, etc. but I think that is all we should do for now.
Peter
More information about the reviews
mailing list