[mercury-users] Advice with UNICODE string type

Richard A. O'Keefe ok at hermes.otago.ac.nz
Thu Jul 15 12:42:52 AEST 1999

	I need a little advice about the best way to represent a UNICODE
	(wide) string type in Mercury.  The constraints are as follows:
	- The wstring type should be fairly easy to construct from
	Mercury strings (though a function would be fine)
	- The wstring type should transfer across the Mercury-C
	interface simply.  If would be nice if C would see it as a true
	UNICODE string, though I'd be happy with a little massaging on
	the C side too.
How exactly do you propose to represent Unicode strings in C?
That's perhaps the most important question, as there is currently
(C89) *no* standard way to do that.  I don't know what's happened
in C9x this year, but there wasn't anything in last year's draft
that suited either.

Yes, there _is_ wchar_t, but what _that_ means is up to the compiler
vendor.  wchar_t could be
    8 bits (very common in older ANSI C compilers)
   32 bits (as it is on Solaris, for example)
but is not terribly likely to be
   16 bits (which Unicode wants).
In fact the Unicode book itself goes to the trouble of defining

typedef unsigned long  UCS4;
typedef unsigned short UCS2;
typedef unsigned short UTF16;
typedef unsigned char  UTF8;

where UCS4 is what you'd want for ISO 10646 and UCS2 is what you'd
want for the Unicode subset.

	This whole requirement comes about because we aggressively
	localise all our products and all our new technologies have pure
	UNICODE interfaces.

What do your Unicode-in-C libraries look like?

	I started naively with a list of ints, but I'm not too happy with this.
Whyever not?  That's what Prolog always used for strings, and it worked
brilliantly.  Indexing into a Unicode string (thanks to floating
diacriticals and zero width characters) is pretty useless; almost anything
you might want to *do* with a Unicode string in Mercury would be most
easily done using DCG rules, which by design fit lists perfectly.

	I suspect, that if I wanted to do this properly (i.e. localise
	the Mercury code too), and Mercury was not likely to get real
	wide character/string types in the near future, I would use
	UTF-8 on the Mercury side and write UTF/UNICODE translation into
	wrappers on the C side.

If all you want to do is have Mercury accept a string from C and then
hand it back, that's fine.  But what processing, if any, do you want
the Mercury code to do with these strings?

And of course there are the obvious questions about how many distinct
strings you need to deal with and what volume of text is involved.
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe

More information about the users mailing list