[mercury-users] Advice with UNICODE string type

Luke Evans Luke.Evans at seagatesoftware.com
Sat Jul 17 03:14:32 AEST 1999


Actually, what I not have made clear is that my primary reason at this time
for doing this work is SIMPLY to get Mercury to HOLD and trivially
manipulate 16bit character strings with the sole purpose of passing these to
a UNICODE interface exposed by one of our standard DLL's which the Mercury
app consumes.

In terms of actual character sets themselves, it's the DLL or code which it
calls which is UNICODE.  On Windows NT this is easy.  On other Windows, we
have implemented character set translation to Shift-JIS (DBCS), and we have
other character set translators on our UNIX implementations.

I've now implemented a wstring module which simply holds on to and
manipulates (via embedded C) strings of 16 bit characters.  It works - I'm
happy.

Lwe


-----Original Message-----
From: Richard A. O'Keefe [mailto:ok at hermes.otago.ac.nz]
Sent: 14 July 1999 19:43
To: mercury-users at cs.mu.OZ.AU
Subject: Re: [mercury-users] Advice with UNICODE string type


	I need a little advice about the best way to represent a UNICODE
	(wide) string type in Mercury.  The constraints are as follows:
	- The wstring type should be fairly easy to construct from
	Mercury strings (though a function would be fine)
	- The wstring type should transfer across the Mercury-C
	interface simply.  If would be nice if C would see it as a true
	UNICODE string, though I'd be happy with a little massaging on
	the C side too.
	
How exactly do you propose to represent Unicode strings in C?
That's perhaps the most important question, as there is currently
(C89) *no* standard way to do that.  I don't know what's happened
in C9x this year, but there wasn't anything in last year's draft
that suited either.

Yes, there _is_ wchar_t, but what _that_ means is up to the compiler
vendor.  wchar_t could be
    8 bits (very common in older ANSI C compilers)
   32 bits (as it is on Solaris, for example)
but is not terribly likely to be
   16 bits (which Unicode wants).
In fact the Unicode book itself goes to the trouble of defining

typedef unsigned long  UCS4;
typedef unsigned short UCS2;
typedef unsigned short UTF16;
typedef unsigned char  UTF8;

where UCS4 is what you'd want for ISO 10646 and UCS2 is what you'd
want for the Unicode subset.

	This whole requirement comes about because we aggressively
	localise all our products and all our new technologies have pure
	UNICODE interfaces.

What do your Unicode-in-C libraries look like?

	I started naively with a list of ints, but I'm not too happy with
this.
	
Whyever not?  That's what Prolog always used for strings, and it worked
brilliantly.  Indexing into a Unicode string (thanks to floating
diacriticals and zero width characters) is pretty useless; almost anything
you might want to *do* with a Unicode string in Mercury would be most
easily done using DCG rules, which by design fit lists perfectly.

	I suspect, that if I wanted to do this properly (i.e. localise
	the Mercury code too), and Mercury was not likely to get real
	wide character/string types in the near future, I would use
	UTF-8 on the Mercury side and write UTF/UNICODE translation into
	wrappers on the C side.

If all you want to do is have Mercury accept a string from C and then
hand it back, that's fine.  But what processing, if any, do you want
the Mercury code to do with these strings?

And of course there are the obvious questions about how many distinct
strings you need to deal with and what volume of text is involved.
--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the users mailing list