[m-dev.] io__write_bytes vs Unicode

Fergus Henderson fjh at cs.mu.OZ.AU
Sat Oct 19 15:22:25 AEST 2002


Zoltan, what is the intended semantics of io__write_bytes?
It takes as input a string, and outputs some bytes.
The comment says "the bytes are taken from a string",
but this is not very clear, since it doesn't say *how*
they are taken from the string, and there are multiple
possibilities which make sense.

For the .NET back-end, strings and characters are represented as Unicode
(using UTF-16, I think).  What should io__write_bytes do?

	(1) Output the numerical value of the Unicode representation,
	    using one byte per character, and throwing an exception
	    if the numerical value doesn't fit in one byte.
	    This would be useful if the string has been constructed
	    using "characters" that are actually just numerical byte
	    values converted using enum__from_int.

	(2) Convert the unicode representation to the system's default
	    character set, throwing an exception if the character can't
	    be represented.  This would be useful if the string contains
	    genuine characters obtained e.g. from string literals,
	    io__read_char, etc.

	(3) Output two bytes for each Unicode character.
	    (Actually, two bytes for each UTF-16 code.  Some Unicode
	    characters will be represented using two 16-bit codes.)
	    If so, which byte ordering shoudl be used?

	(4) Something else?

Currently the only use of io__write_bytes in the system is in
compiler/bytecode.m.  This, together with bytecode/mb_bytecode.c
and bytecode/mb_disasm.c, which read in the data and print it out
to stdout, implicitly assume that io__write_bytes does (2).
However, I think (1) is a more natural interpretation of the specification.
The name and the specification talk about bytes, not characters.

Note that (1) and (2) have exactly the same effect if the string
contains only 7-bit ASCII and the system's default character set
includes 7-bit ASCII as a subset, which is usually the case.
But they will do different things for characters which are not in
Latin-1, or for the 8-bit characters in Latin-1 if the system's
default character set is not Latin-1.

If we choose (1), what should we do about compiler/bytecode.m?
Leave it broken and just add an XXX comment?

-- 
Fergus Henderson <fjh at cs.mu.oz.au>  |  "I have always known that the pursuit
The University of Melbourne         |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions:          mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------



More information about the developers mailing list