[m-rev.] add character ranges to extras/lex

Sebastian Godelet sebastian.godelet+github at gmail.com
Sun Feb 16 06:30:32 AEDT 2014


For review by anyone.

To facilitate easier lexeme definition,
add a new range/2 function which works as a simple character class like in
Perl regular expressions.
For example: range('a', 'f') = any("abcdef").

extras/lex/lex.m:
   adds func range(char, char) = regexp.

extras/lex/samples/lex_demo.m
   adds a word recognizer just before the "junk" lexeme.

I hope you find this useful.
If my changes get approved in some form (this is my first contribution)
I'd happily enhance the basic lexer to become more expressive and powerful.
I was thinking of range/3: range(From, To, Exclude = type set(char).

Sebastian Godelet.

diff --git a/extras/lex/lex.m b/extras/lex/lex.m
index c6c7930..7601f5a 100644
--- a/extras/lex/lex.m
+++ b/extras/lex/lex.m
@@ -107,6 +107,7 @@
 :- func anybut(string) = regexp.     % anybut("abc") is complement of
any("abc")
 :- func ?(T) = regexp <= regexp(T).  % ?(R)       = R or null
 :- func +(T) = regexp <= regexp(T).  % +(R)       = R ++ *(R)
+:- func range(char, char) = regexp.  % range('a', 'z') = any("ab...xyz")

     % Some useful single-char regexps.
     %
@@ -746,6 +747,30 @@ str_foldr(Fn, S, X, I) =

 +(R) = (R ++ *(R)).

+range(S, E) = R :-
+    char.to_int(S, Si),
+    char.to_int(E, Ei),
+    ( if Si < Ei then
+        R = build_range(Si + 1, Ei, re(S))
+      else if Si = Ei then
+        R = re(S)
+      else
+        R = null
+    ).
+
+:- func build_range(int, int, regexp) = regexp.
+
+build_range(S, E, R0) = R :-
+    ( if S < E then
+        char.det_from_int(S, C),
+        R1 = (R0 or re(C)),
+        R = build_range(S + 1, E, R1)
+      else if S = E then
+        R = R0
+      else
+        throw(exception.software_error("invalid range!"))
+    ).
+
 %-----------------------------------------------------------------------------%
 % Some useful single-char regexps.

@@ -768,13 +793,13 @@ alpha      = (lower or upper).
 alphanum   = (alpha or digit).
 identstart = (alpha or ('_')).
 ident      = (alphanum or ('_')).
-nl         = re('\n').
 tab        = re('\t').
 spc        = re(' ').

 %-----------------------------------------------------------------------------%
 % Some useful compound regexps.

+nl         = (?('\r') ++ '\n').  % matches both Posix and Windows newline.
 nat        = +(digit).
 signed_int = ?("+" or "-") ++ nat.
 real       = signed_int ++ (
diff --git a/extras/lex/samples/lex_demo.m b/extras/lex/samples/lex_demo.m
index 6d30ac2..68aef0d 100644
--- a/extras/lex/samples/lex_demo.m
+++ b/extras/lex/samples/lex_demo.m
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/reviews/attachments/20140215/943c62fa/attachment.html>


More information about the reviews mailing list