[mercury-users] Advice with UNICODE string type
Richard A. O'Keefe
ok at hermes.otago.ac.nz
Thu Jul 15 12:42:52 AEST 1999
I need a little advice about the best way to represent a UNICODE
(wide) string type in Mercury. The constraints are as follows:
- The wstring type should be fairly easy to construct from
Mercury strings (though a function would be fine)
- The wstring type should transfer across the Mercury-C
interface simply. If would be nice if C would see it as a true
UNICODE string, though I'd be happy with a little massaging on
the C side too.
How exactly do you propose to represent Unicode strings in C?
That's perhaps the most important question, as there is currently
(C89) *no* standard way to do that. I don't know what's happened
in C9x this year, but there wasn't anything in last year's draft
that suited either.
Yes, there _is_ wchar_t, but what _that_ means is up to the compiler
vendor. wchar_t could be
8 bits (very common in older ANSI C compilers)
32 bits (as it is on Solaris, for example)
but is not terribly likely to be
16 bits (which Unicode wants).
In fact the Unicode book itself goes to the trouble of defining
typedef unsigned long UCS4;
typedef unsigned short UCS2;
typedef unsigned short UTF16;
typedef unsigned char UTF8;
where UCS4 is what you'd want for ISO 10646 and UCS2 is what you'd
want for the Unicode subset.
This whole requirement comes about because we aggressively
localise all our products and all our new technologies have pure
UNICODE interfaces.
What do your Unicode-in-C libraries look like?
I started naively with a list of ints, but I'm not too happy with this.
Whyever not? That's what Prolog always used for strings, and it worked
brilliantly. Indexing into a Unicode string (thanks to floating
diacriticals and zero width characters) is pretty useless; almost anything
you might want to *do* with a Unicode string in Mercury would be most
easily done using DCG rules, which by design fit lists perfectly.
I suspect, that if I wanted to do this properly (i.e. localise
the Mercury code too), and Mercury was not likely to get real
wide character/string types in the near future, I would use
UTF-8 on the Mercury side and write UTF/UNICODE translation into
wrappers on the C side.
If all you want to do is have Mercury accept a string from C and then
hand it back, that's fine. But what processing, if any, do you want
the Mercury code to do with these strings?
And of course there are the obvious questions about how many distinct
strings you need to deal with and what volume of text is involved.
--------------------------------------------------------------------------
mercury-users mailing list
post: mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe: Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------
More information about the users
mailing list