[m-rev.] add character ranges to extras/lex

Sebastian Godelet sebastian.godelet+github at gmail.com
Mon Feb 24 07:03:13 AEDT 2014


I had some more time to do more substantial improvements (that I think) to
the extra/lex library.

It was easier to include all previous patches and the new code into a pull
request (based on master, but actually any recent branch should be fine)
I'm sorry that I didn't write the patch in one chunk, I'm still learning
how to program in Mercury, I guess I caused to much traffic in the mailing
list :(


branches: 14.01,master
estimated review time: 30m - 1h

The lexer in extras/lex now supports:

Unicode characters
character ranges, a specialization of the also supported
character sets Speed improvement by directly using the sparse_bitset for
NFA edge storage, which is of lesser complexity than "or"ing thousands of
characters (as required for some Unicode scripts)

- changed the basic transition char to charset
- transition --> atom(Char) is now made a singleton charset
- compiled regexp now uses a record for state_no and char, instead of a
packed int,
otherwise the codepoint gets truncated

- added use cases for the liblex changes
- including Unicode support

- dot and nl recognize Windows newlines aswell
- introduced range/2 for Character ranges
- added a charset type alias == sparse_bitset(char)
- added an regexp(charset) instance
- and a type tag for charset in type regexp
- exporting charset helper functions, e.g. charset and charset_from_lists


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20140223/365046d8/attachment.html>

More information about the reviews mailing list