[mercury-users] XML Parsing

Richard A. O'Keefe ok at atlas.otago.ac.nz
Tue Jun 5 15:41:32 AEST 2001


	1. (Prepare for scream from R.A.O'K :o) rewrite those parts of the
	parser
	that are Unicode aware to be ASCII-only and use bitmaps to identify
	character subsets.
	
You asked for it!

In my own XML parser, I cheated.  I did the old Quintus trick of saying
that anything above the ASCII range can be used in an identifier.

So a simple
    classify(Ch, Class) :-
        (   Ch > 127 -> Class = letter
        ;   ascii_classify(Ch, Class)
        )
sort of approach will actually work pretty well.

What's the difference between doing this and doing the "Right Thing"?
Well, to start with, XML is defined in terms of ISO 10646, and if you
look _very_ carefully at the SGML declaration in the XML spec you'll see
that you'll need some *very* big tables (more than a million entries) if
you try a direct table-driven approach.
Then you'll discover that it's not worth it, as for any ISO 10646
character, exactly one of three things is true:
    - it is treated as a letter in XML
    - it is illegal in XML
    - it is a legal non-letter and is in the ASCII range.

For tighter packing, see the way Java does it.

Oh yeah, in my non-validating XML parser in Mercury, quite a bit of the
time is soaked up in converting from UTF-8 to Unicode and back again.
I think I should stick to UTF-8 throughout.

--------------------------------------------------------------------------
mercury-users mailing list
post:  mercury-users at cs.mu.oz.au
administrative address: owner-mercury-users at cs.mu.oz.au
unsubscribe: Address: mercury-users-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-users-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the users mailing list