FW: [mercury-users] Records
Richard A. O'Keefe
ok at hermes.otago.ac.nz
Wed Nov 10 13:58:49 AEDT 1999
> - Java, ANSI C++, and C9x all allow Unicode characters in comments,
> (String,wchar_t*,wchar_t*) strings, *and in identifiers*, using the
> same syntax.
In ANSI/ISO C++ and C99 (note x=9 ;-), the `wchar_t' type
*may* be Unicode, but it is not required to be.
Whether the wchar_t type is Unicode or not, the \uhhhh and \Uhhhhhhhh
notation *is* by definition Unicode/ISO 10646.
It would not be that hard to add `wchar' and `wstring' types
to the Mercury standard library, if there was a big demand for it.
But this type would then be dependent on the implementation
(i.e. typically on the underlying C implementation).
It would be dependent on *your* implementation, and need not be in
any way dependent on the underlying C implementation. For example,
I have my own functions for reading and writing UTF-8, because the
three C systems I deal with that *say* they support Unicode (the
other two don't) don't actually work for me, and I cannot discover
from the documentation why. What I am saying is that in my own C89
code, I can deal with wide characters without relying on *any*
support whatsoever in the C implementation. The bulkiest thing is
the table that implements character classification (and a BIG thank-you
to unicode.org for the on-line file it's derived from).
Alternatively it would be possible to change the representation of the
`string' and `char' types in the Mercury implementation so that they
matched C's `wchar_t' type.
There are good reasons for doing that; but there is no *implementation*
need to do so. The thing can be done perfectly well even in Classic C
where there isn't any wchar_t type. Had Unicode existed when I was at
Quintus, I would have argued for doing this on the then pressing grounds
that it would have made porting Prolog code between UNIX and MVS vastly
easier.
However, I'm not sure that this would be worthwhile. The introduction
of UCNs in C was controversial. Paul Eggert, whose opinion I have a
high respect for, has described UCNs as being of interest to language
lawyers, and a source of make-work for implementors, but unlikely to be
used much in practice.
UCNs are exactly the same as trigraphs: they are an attempt to solve
_in_ the language a problem that is best solved _outside_ the language.
They are not anything that any reasonable person would type in by
hand; they are merely a device for *transporting* Java/C++/C code to
or through Unicode-hostile environments. As for being "a source of
make-work for implementors", my convert_UCN_to_UCS4() and
convert_UCS4_to_UCN() functions (with interfaces modelled on the
conversion functions in the Unicode book) come to 97 SLOC (excluding
comments, blank lines, and punctuation-only lines), and took under an
hour to code. Stuffing this feature into a C tokeniser is marginally
harder (note that the extended C compiler for Plan 9 already handles
UTF-8 directly in the tokeniser), but not a _lot_ trickier.
Here's a UCN to UTF-8 converter, built on top of conversion functions
I already had. (convert_UCS4_to_UT8 comes from the Unicode book,
convert_UCN_to_UCS4 is mine).
#include <stdio.h>
#include <stdlib.h>
#include "cvtutf.h"
#define BUFLEN 8192
int main(void) {
UCHR in8 [BUFLEN], *p8, *e8 = in8 + BUFLEN;
UCS4 in31[BUFLEN], *p31, *e31 = in31 + BUFLEN;
UTF8 out8[BUFLEN*3], *o8, *f8 = out8 + BUFLEN*3;
long L;
for (L = 1; fgets((char*)in8, sizeof in8, stdin) != 0; L++) {
p8 = in8, p31 = in31;
switch (convert_UCN_to_UCS4(&p8, e8, &p31, e31)) {
case source_exhausted:
fprintf(stderr, "Bad UCN in line %ld\n", L);
return EXIT_FAILURE;
case target_exhausted:
fprintf(stderr, "Internal error, notify RAO'K\n");
abort();
case converted_ok:
break;
}
p31 = in31, o8 = out8;
switch (convert_UCS4_to_UTF8(&p31, e31, &o8, f8)) {
case source_exhausted:
fprintf(stderr, "Internal error, notify RAO'K\n");
abort();
case target_exhausted:
fprintf(stderr, "Buffer too small at line %ld\n", L);
return EXIT_FAILURE;
case converted_ok:
break;
}
if (fwrite(out8, 1, o8-out8, stdout) != o8-out8) {
fprintf(stderr, "Output error at line %ld\n", L);
return EXIT_FAILURE;
}
}
return EXIT_SUCCESS;
}
There are a couple of changes I could make to my library to shorten
this, but it's already quite a lot simpler than the UNIX 'cat' command.
Ok, so it has this bug: if there is an extremely long input line
where ASCII character number 8192 happens to land _inside_ a UCN, then
it's not going to work, *but* in that case it will print an error message
and halt. For 37 SLOC it's not so bad; the utf to ucn conversion
program is 33 SLOC.
Here's a quote from one article that (Eggert) recently
wrote in comp.std.c:
| The problem of writing internationalized code is a large one, and UCNs
| attack only a tiny part of it. They don't solve the overall problem,
| nor do they pretend to.
What they attack is the *transport* problem.
An important concept in the C and C++ standards is that of a
"translation phase". A C++/C99 ***system*** has to handle UCNs
somehow, but that "translation phase" could very well involve
running a separate program. It might be the preprocessor, it might
even be another program in front of that.
| Anybody who regularly writes and deploys applications that _do_ solve
| the overall problem necessarily uses technologies that are more
| powerful than UCNs. UCNs will displace these technologies only if
| they offer compelling advantages. But UCNs have no real advantages.
As a transport mechanism, UTF-8 is superior to UCNs. (A Unicode character
can take at most 3 bytes in UTF-8; every non-ASCII 16-bit character will
take 6 bytes using UCNs.) UCNs have two advantages.
- If the bulk of the text is ASCII characters, the text can remain
legible without the need for Unicode-aware tools at all.
- They _suggest_ to compiler writers a technique for encoding wide-
character identifiers in a way that 8-bit-only linkers might accept.
That means that there is a chance (slim, but a chance) that two
different C compilers *might* inter-operate...
| On the contrary. UCNs don't work across a wide spectrum of programming
| languages.
This is a rather bizarre comment. On the one hand, it's a feature in the
C99 standard, C++ standard, and Java specification. Critising UCNs for
not "work[ing] across a wide spectrum of programming languages" is rather
like criticising C switch() statements on the same grounds. (Um, on
second thought, that's a sitting duck. Make that C scanf() formats.)
| They are unreadable without special tools that (by and
| large) don't exist.
This too is a baffling comment. The point of UCNs is precisely
that they *are* marginally readable without special tools, whereas
UCS4, UTF16, or UTF8 are *not*. As for the readability of UCN text,
take my 37 SLOC ucn2utf.c, hook it up to the Yudit editor with a
tiny shell script, and presto chango a UCN viewer built out of existing
components.
| And they require Unicode support, perhaps even at
| run-time for practical implementations.
Again, it is precisely the point of UCNs *NOT* to require much
Unicode support, and none from the language the compiler is written in.
You need
- a way to map Unicode characters to the native character set,
IF that character set is not a truncation of Unicode (so a C
compiler in an ISO Latin 1 environment supporting only 8-bit
characters needs no table). This table need not be full.
A huge collection of such tables is already available for free download.
- a way to classify Unicode characters so that identifiers can be
parsed correctly. The so-and-so standardisers didn't even follow the
Unicode rules for identifiers, although they could have without
backwards incompatibility, in order to make life easy for compiler
writers. The requisite information is again available for download
free from unicode.org, and can be compressed into quite a small table.
| All of these are serious
| technical drawbacks that should be immediately obvious to anyone with
| experience in developing internationalized code. The UCN-related
| ambiguities and undefined behaviors in C99 are additional red flags.
My news feed has been so hopeless (thanks, ISP) that I've given up reading
news, including comp.std.c, so I don't know what those ambiguities and
undefined behaviours are. However, ambiguities and undefined behaviours
are nothing new in C standards, so it seems unfair to blame UCNs.
It seems particularly unfair to *blame* Jave/C++/C for *trying* to solve a
problem that other programming language standards (Javascript, Ada95) push
under someone else's carpet. If a Javascript string literal is inside an
XML or HTML4 file, there is a standard notation that can be used, but it
is not clear to me that
"ङntèA;" (Javascript inside HTML4)
is any improvement over
"\u3239" "nt" "\u232A" (C99)
as a way of depicting "<nt>" with true angle brackets.
Anyway, even if were to add UCN support to Mercury, for the core
language and for the identifiers in the standard library I think we
should stick with just the printable ASCII characters that appear on
standard keyboards.
Could you at least allow the ×, ÷ and ¬
character as _alternatives_ for multiplication, division, and
logical negation?
--------------------------------------------------------------------------
mercury-users mailing list
post: mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe: Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------
More information about the users
mailing list