[m-rev.] New regex module in extras/lex

Ralph Becket rafe at cs.mu.OZ.AU
Fri Nov 22 17:02:39 AEDT 2002


It might be nice to include this one in extras in the release since
people have asked for such a thing in the past.  On the other hand, it
is brand new code...

- Ralph

Estimated hours taken: 32
Branches: main

Added a new module, regex, as a companion to lex.  The new module provides
functionality for converting conventional Unix-style regular expressions
into regexps for use with lex and a number of search and search-and-replace
functions for strings.

The new functionality has been tested fairly thoroughly (and led to several
bugs in lex being identified and fixed.)

extras/lex/lex.automata.m:
extras/lex/lex.buf.m:
extras/lex/lex.convert_NFA_to_DFA.m:
extras/lex/lex.regexp.m:
	Trivial formatting changes.

extras/lex/lex.lexeme.m:
	Removed the parameter on inst compiled_lexeme.

extras/lex/lex.m:
	Various formatting changes.

	Added pred offset_from_start/3 which can be used to identify
	the `current' point in the input stream with respect to lexing.

	Added pred read_char/3 which can be used to read the `next'
	char from the input stream without doing any lexing.

	Added a field init_winner_func to the lexer_instance type.  This
	is used to resolve a bug whereby regular expressions that match
	the empty string were not being spotted at the start of the input
	stream.

	Solved some bugs whereby an exception was incorrectly thrown in
	some circumstance when the end of the input stream was reached.

extras/lex/regex.m:
	Added.  This file defines the functions for converting Unix-style
	regular expression strings into regexps for use with lex and into
	regexes for use with the string search(-and-replace) predicates
	defined in this module.

extras/lex/test_regex.m:
	A little test harness for regex.m

Index: lex.automata.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.automata.m,v
retrieving revision 1.1
diff -u -r1.1 lex.automata.m
--- lex.automata.m	21 Feb 2001 16:29:36 -0000	1.1
+++ lex.automata.m	22 Nov 2002 00:14:43 -0000
@@ -1,4 +1,4 @@
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % lex.automata.m
 % Copyright (C) 2001 Ralph Becket <rbeck at microsoft.com>
 % Fri Aug 18 15:48:09 BST 2000
@@ -10,7 +10,7 @@
 %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
 %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
 %
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- module lex__automata.
 
@@ -48,5 +48,5 @@
 :- inst atom_transition == bound(trans(ground, ground, ground)).
 :- inst null_transition == bound(null(ground, ground)).
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
Index: lex.buf.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.buf.m,v
retrieving revision 1.2
diff -u -r1.2 lex.buf.m
--- lex.buf.m	4 Oct 2001 07:46:03 -0000	1.2
+++ lex.buf.m	22 Nov 2002 00:14:25 -0000
@@ -1,4 +1,4 @@
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % lex.buf.m
@@ -89,7 +89,7 @@
 % means that the region prior to the cursor in the buffer is
 % now available for garbage collection.
 %
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- module lex__buf.
 
@@ -166,8 +166,8 @@
 :- func commit(buf_state(T)) = buf_state(T).
 :- mode commit(in(buf_state)) = out(buf_state) is det.
 
-% -----------------------------------------------------------------------------%
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- implementation.
 
@@ -192,13 +192,13 @@
 % :- func initial_buf_size = int.
 % initial_buf_size = 32.
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 init(BufReadPred, BufState, Buf) :-
     BufState = buf_state(0, 0, 0, 0, initial_buf_size, no, BufReadPred),
     Buf      = array__init(initial_buf_size, ('@')).
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 read(Result, BufState0, BufState, Buf0, Buf, Src0, Src) :-
 
@@ -252,7 +252,7 @@
         read(Result, BufState1, BufState, Buf1, Buf, Src0, Src)
     ).
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % Garbage collects the chars between the origin and start and
     % extends the buffer if the remaining space is below the low
@@ -275,7 +275,7 @@
 
     Buf = shift_buf(0, Size0 - GarbageLength, GarbageLength, Buf0, Buf1).
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func shift_buf(int, int, int, buf, buf) = buf.
 :- mode shift_buf(in, in, in, array_ui, array_di) = array_uo is det.
@@ -288,24 +288,23 @@
         Tgt
     ).
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 start_offset(BufState) = BufState ^ buf_start.
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 cursor_offset(BufState) = BufState ^ buf_cursor.
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 rewind_cursor(Offset, BufState) =
-    ( if ( Offset < BufState ^ buf_start ; BufState ^ buf_cursor < Offset ) then
-        throw("buf: rewind/2: offset arg outside valid range")
-      else
-        BufState ^ buf_cursor := Offset
+    ( if   ( Offset < BufState ^ buf_start ; BufState ^ buf_cursor < Offset )
+      then throw("buf: rewind/2: offset arg outside valid range")
+      else BufState ^ buf_cursor := Offset
     ).
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 string_to_cursor(BufState, Buf) = String :-
     From   = BufState ^ buf_start - BufState ^ buf_origin,
@@ -313,9 +312,9 @@
     To     = From + Length,
     String = string__from_char_list(array__fetch_items(Buf, From, To)).
 
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 commit(BufState) = ( BufState ^ buf_start := BufState ^ buf_cursor ).
 
-% -----------------------------------------------------------------------------%
-% -----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
Index: lex.convert_NFA_to_DFA.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.convert_NFA_to_DFA.m,v
retrieving revision 1.2
diff -u -r1.2 lex.convert_NFA_to_DFA.m
--- lex.convert_NFA_to_DFA.m	4 Oct 2001 07:46:03 -0000	1.2
+++ lex.convert_NFA_to_DFA.m	22 Nov 2002 00:15:08 -0000
@@ -1,4 +1,4 @@
-%-----------------------------------------------------------------------------
+%----------------------------------------------------------------------------
 %
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
@@ -12,7 +12,7 @@
 %   BE RELEASED UNDER WHATEVER LICENCE IS DEEMED APPROPRIATE
 %   BY THE ADMINISTRATORS OF THE MERCURY PROJECT.
 %
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- module lex__convert_NFA_to_DFA.
 
@@ -26,8 +26,8 @@
 :- mode convert_NFA_to_DFA(in(null_transition_free_state_mc)) =
             out(null_transition_free_state_mc) is det.
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- implementation.
 
@@ -50,7 +50,7 @@
 :- type state_set_no_map
     ==      map(state_set, int).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 convert_NFA_to_DFA(NFA) = NFA :-
 
@@ -96,7 +96,7 @@
         %
     DFA = state_mc(DFAStartState, DFAStopStates, DFATransitions).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % If S is a state_no set, then S -c-> S' where
     % S' = {y | x in S, x -c-> y}
@@ -135,7 +135,7 @@
         compute_DFA_state_sets_and_transitions(Ts, NewSs, Ss1, Ss, STs1, STs)
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % Given a state_no set and a set of transition chars for that
     % state_no set, find the set of state_no set transitions (said
@@ -149,7 +149,7 @@
     TCs = transition_chars(Ts, S),
     STs = list__map(state_set_transition(Ts, S), TCs).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % Given a state_no set, find all the transition chars:
     %
@@ -164,7 +164,7 @@
         )
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func transition_chars_for_state(transitions, state_no) = list(char).
 :- mode transition_chars_for_state(in, in) = out is det.
@@ -172,14 +172,14 @@
 transition_chars_for_state(Ts, X) =
     list__filter_map(transition_char_for_state(X), Ts).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func transition_char_for_state(state_no, transition) = char.
 :- mode transition_char_for_state(in, in) = out is semidet.
 
 transition_char_for_state(X, trans(X, C, _Y)) = C.
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % Given a state_no set and a char, find the state_no set transition:
     %
@@ -190,7 +190,7 @@
 
 state_set_transition(Ts, S, C) = trans(S, C, target_state_set(Ts, S, C)).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % Given a state_no set and a char, find the target state_no set:
     %
@@ -201,28 +201,28 @@
 target_state_set(Ts, S, C) =
     set__power_union(set__map(target_state_set_0(Ts, C), S)).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func target_state_set_0(transitions, char, state_no) = state_set.
 
 target_state_set_0(Ts, C, X) =
     set__list_to_set(list__filter_map(target_state(X, C), Ts)).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func target_state(state_no, char, transition) = state_no.
 :- mode target_state(in, in, in) = out is semidet.
 
 target_state(X, C, trans(X, C, Y)) = Y.
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func compute_DFA_stop_state_sets(state_set, state_sets) = state_sets.
 
 compute_DFA_stop_state_sets(StopStates, StateSets) =
     set__filter_map(stop_state_set(StopStates), StateSets).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func stop_state_set(state_set, state_set) = state_set.
 :- mode stop_state_set(in, in) = out is semidet.
@@ -230,7 +230,7 @@
 stop_state_set(StopStates, StateSet) = StateSet :-
     not set__empty(StopStates `set__intersect` StateSet).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func number_state_sets(state_sets) = state_set_no_map.
 
@@ -244,7 +244,7 @@
         map__init,  StateNos
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func map_state_set_transitions_to_numbers(state_set_no_map,
             state_set_transitions
@@ -261,5 +261,5 @@
         STs
     ).
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
Index: lex.lexeme.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.lexeme.m,v
retrieving revision 1.2
diff -u -r1.2 lex.lexeme.m
--- lex.lexeme.m	4 Oct 2001 07:46:03 -0000	1.2
+++ lex.lexeme.m	22 Nov 2002 01:36:07 -0000
@@ -1,4 +1,4 @@
-%-----------------------------------------------------------------------------
+%----------------------------------------------------------------------------
 %
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
@@ -19,7 +19,7 @@
 % stream or an error if no match occurs.
 %
 %
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- module lex__lexeme.
 
@@ -33,8 +33,8 @@
                 state              :: state_no,
                 transition_map     :: transition_map
             ).
-:- inst compiled_lexeme(Inst)
-    --->    compiled_lexeme(Inst, ground, ground).
+:- inst compiled_lexeme
+    --->    compiled_lexeme(token_creator, ground, ground).
 
 :- type transition_map
     --->    transition_map(
@@ -68,22 +68,22 @@
     % an accepting state_no.
     %
 :- pred next_state(compiled_lexeme(T), state_no, char, state_no, bool).
-:- mode next_state(in(live_lexeme), in, in, out, out) is semidet.
+:- mode next_state(in(compiled_lexeme), in, in, out, out) is semidet.
 
     % Succeeds iff a compiled_lexeme is in an accepting state_no.
     %
-:- pred in_accepting_state(live_lexeme(T)).
-:- mode in_accepting_state(in(live_lexeme)) is semidet.
+:- pred in_accepting_state(compiled_lexeme(T)).
+:- mode in_accepting_state(in(compiled_lexeme)) is semidet.
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- implementation.
 
 :- import_module list, set.
 :- import_module lex__automata, lex__convert_NFA_to_DFA, lex__regexp.
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 compile_lexeme(Lexeme) = CompiledLexeme :-
     Lexeme         = (RegExp - TokenCreator),
@@ -98,7 +98,7 @@
     TransitionMap  = transition_map(Accepting, Rows),
     CompiledLexeme = compiled_lexeme(TokenCreator, StartState, TransitionMap).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func find_top_state(transitions) = int.
 :- mode find_top_state(in(atom_transitions)) = out is det.
@@ -106,7 +106,7 @@
 find_top_state([])                    = 0.
 find_top_state([trans(X, _, Y) | Ts]) = max(X, max(Y, find_top_state(Ts))).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func set_accepting_states(set(state_no), bitmap) = bitmap.
 :- mode set_accepting_states(in, bitmap_di) = bitmap_uo is det.
@@ -124,7 +124,7 @@
 set_accepting_states_0([St | States], Bitmap) =
     set_accepting_states_0(States, bitmap__set(Bitmap, St)).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func set_up_rows(int, int, transitions) = list(row).
 :- mode set_up_rows(in, in, in(atom_transitions)) = out is det.
@@ -136,7 +136,7 @@
                    set_up_rows(I + 1, N, Transitions)]
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func compile_transitions_for_state(int, list(byte_transition), transitions) =
             row.
@@ -155,7 +155,7 @@
         Ts
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 byte_transition(Byte, State) = (State << 8) \/ Byte.
 
@@ -163,7 +163,7 @@
 
 btr_state(BT) = BT `unchecked_right_shift` 8.
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 next_state(CLXM, CurrentState, Char, NextState, IsAccepting) :-
     Rows            = CLXM ^ transition_map ^ rows,
@@ -171,7 +171,7 @@
     find_next_state(char__to_int(Char), Rows ^ elem(CurrentState), NextState),
     IsAccepting     = bitmap__get(AcceptingStates, NextState).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred find_next_state(int, array(byte_transition), state_no).
 :- mode find_next_state(in, in, out) is semidet.
@@ -194,12 +194,12 @@
       else find_next_state_0(Lo + 1, Hi, Byte, ByteTransitions, State)
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 in_accepting_state(CLXM) :-
     bitmap__is_set(
         CLXM ^ transition_map ^ accepting_states, CLXM ^ state
     ).
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
Index: lex.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.m,v
retrieving revision 1.4
diff -u -r1.4 lex.m
--- lex.m	29 Oct 2002 03:38:30 -0000	1.4
+++ lex.m	22 Nov 2002 04:59:27 -0000
@@ -1,4 +1,4 @@
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % lex.m
@@ -16,7 +16,7 @@
 % This module puts everything together, compiling a list of lexemes
 % into state machines and turning the input stream into a token stream.
 %
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- module lex.
 
@@ -133,11 +133,11 @@
    % Utility function to return noval tokens.
    % Use it in the form `return(my_token) inside a lexeme definition.
    %
-:- func return(T,string) = T.
+:- func return(T, string) = T.
 
    % Utility operator to create lexemes.
    %
-:- func (T1 -> token_creator(Tok)) = pair(regexp,token_creator(Tok))
+:- func (T1 -> token_creator(Tok)) = pair(regexp, token_creator(Tok))
             <= regexp(T1).
 
     % Construct a lexer from which we can generate running
@@ -181,10 +181,18 @@
 :- func start(lexer(Tok, Src), Src) = lexer_state(Tok, Src).
 :- mode start(in(lexer), di) = uo is det.
 
-:- pred read(io__read_result(Tok), lexer_state(Tok, Src),
-            lexer_state(Tok, Src)).
+:- pred read(io__read_result(Tok),
+            lexer_state(Tok, Src), lexer_state(Tok, Src)).
 :- mode read(out, di, uo) is det.
 
+    % Calling offset_from_start/3 immediately prior to calling read/3
+    % will give the offset in chars from the start of the input stream
+    % for the result returned by the read/3 operation.
+    %
+:- pred offset_from_start(offset,
+            lexer_state(Tok, Src), lexer_state(Tok, Src)).
+:- mode offset_from_start(out, di, uo) is det.
+
     % Stop a running instance of a lexer and retrieve the input source.
     %
 :- func stop(lexer_state(_Tok, Src)) = Src.
@@ -195,11 +203,17 @@
     % provides that sort of access.
     %
 :- pred manipulate_source(pred(Src, Src),
-                lexer_state(Tok, Src), lexer_state(Tok, Src)).
+            lexer_state(Tok, Src), lexer_state(Tok, Src)).
 :- mode manipulate_source(pred(di, uo) is det, di, uo) is det.
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+    % This is occasionally useful.  It reads the next char from the
+    % input stream, without attempting to match it against a lexeme.
+    %
+:- pred read_char(read_result, lexer_state(Tok, Src), lexer_state(Tok, Src)).
+:- mode read_char(out, di, uo) is det.
+
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- implementation.
 
@@ -225,6 +239,7 @@
 :- type lexer_instance(Token, Source)
     --->    lexer_instance(
                 init_lexemes            :: list(live_lexeme(Token)),
+                init_winner_func        :: init_winner_func(Token),
                 live_lexemes            :: list(live_lexeme(Token)),
                 current_winner          :: winner(Token),
                 buf_state               :: buf_state(Source),
@@ -234,6 +249,7 @@
 :- inst lexer_instance
     --->    lexer_instance(
                 live_lexeme_list, 
+                init_winner_func, 
                 live_lexeme_list, 
                 winner, 
                 buf__buf_state,
@@ -243,10 +259,15 @@
 :- type live_lexeme(Token)
     ==      compiled_lexeme(Token).
 :- inst live_lexeme
-    ==      compiled_lexeme(token_creator).
+    ==      compiled_lexeme.
 :- inst live_lexeme_list
     ==      list__list_skel(live_lexeme).
 
+:- type init_winner_func(Token)
+    ==      ( func(offset) = winner(Token) ).
+:- inst init_winner_func
+    ==      ( func(in)     = out is det    ).
+
 
 
 :- type winner(Token)
@@ -255,19 +276,19 @@
     --->    yes(pair(token_creator, ground))
     ;       no.
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
-ignore(Tok,Tok).
+ignore(Tok, Tok).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 return(Token, _) = Token.
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 (R1 -> TC) = (re(R1) - TC).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 init(Lexemes, BufReadPred) = init(Lexemes, BufReadPred, DontIgnoreAnything) :-
     DontIgnoreAnything = ( pred(_::in) is semidet :- semidet_fail ).
@@ -277,65 +298,97 @@
  :-
     CompiledLexemes = list__map(compile_lexeme, Lexemes).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
-start(Lexer, Src) = LexerState :-
+start(Lexer, Src) = State :-
     init_lexer_instance(Lexer, Instance, Buf),
-    LexerState = args_lexer_state(Instance, Buf, Src).
+    State = args_lexer_state(Instance, Buf, Src).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred init_lexer_instance(lexer(Tok, Src), lexer_instance(Tok, Src), buf).
 :- mode init_lexer_instance(in(lexer), out(lexer_instance), array_uo) is det.
 
 init_lexer_instance(Lexer, Instance, Buf) :-
     buf__init(Lexer ^ lex_buf_read_pred, BufState, Buf),
-    InitLexemes = Lexer ^ lex_compiled_lexemes,
-    IgnorePred  = Lexer ^ lex_ignore_pred,
-    Instance    = lexer_instance(InitLexemes, InitLexemes, no,
-                        BufState, IgnorePred).
+    Start          = BufState ^ start_offset,
+    InitLexemes    = Lexer ^ lex_compiled_lexemes,
+    InitWinnerFunc = initial_winner_func(InitLexemes),
+    IgnorePred     = Lexer ^ lex_ignore_pred,
+    Instance       = lexer_instance(InitLexemes, InitWinnerFunc, InitLexemes,
+                            InitWinnerFunc(Start), BufState, IgnorePred).
+
+%-----------------------------------------------------------------------------%
+
+    % Lexing may *start* with a candidate winner if one of the lexemes
+    % accepts the empty string.  We pick the first such, if any, since
+    % that lexeme has priority.
+    %
+:- func initial_winner_func(list(live_lexeme(Token))) = init_winner_func(Token).
+:- mode initial_winner_func(in(live_lexeme_list)    ) = out(init_winner_func)
+            is det.
+
+initial_winner_func([]       ) =
+    ( func(_) = no ).
+
+initial_winner_func( [L | Ls]) =
+    ( if   in_accepting_state(L)
+      then ( func(Offset) = yes(L ^ token - Offset) )
+      else initial_winner_func(Ls)
+    ).
+
+%----------------------------------------------------------------------------%
 
-%------------------------------------------------------------------------------%
+offset_from_start(Offset, !State) :-
+    Offset  = !.State ^ run ^ buf_state ^ buf_cursor,
+    !:State = unsafe_promise_unique(!.State).
 
-stop(LexerState) = Src :-
-    lexer_state_args(LexerState, _Instance, _Buf, Src).
+%-----------------------------------------------------------------------------%
 
-%------------------------------------------------------------------------------%
+stop(State) = Src :-
+    lexer_state_args(State, _Instance, _Buf, Src).
 
-read(Result, LexerState0, LexerState) :-
-    lexer_state_args(LexerState0, Instance0, Buf0, Src0),
-    read_0(Result, Instance0, Instance, Buf0, Buf, Src0, Src),
-    LexerState = args_lexer_state(Instance, Buf, Src).
+%-----------------------------------------------------------------------------%
 
+read(Result, State0, State) :-
+    lexer_state_args(State0, Instance0, Buf0, Src0),
+    BufState       = Instance0 ^ buf_state,
+    Start          = BufState ^ start_offset,
+    InitWinnerFunc = Instance0 ^ init_winner_func,
+    Instance1      = ( Instance0 ^ current_winner := InitWinnerFunc(Start) ),
+    read_2(Result, Instance1, Instance, Buf0, Buf, Src0, Src),
+    State          = args_lexer_state(Instance, Buf, Src).
 
 
-:- pred read_0(io__read_result(Tok),
+
+:- pred read_2(io__read_result(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
             buf, buf, Src, Src).
-:- mode read_0(out,
+:- mode read_2(out,
             in(lexer_instance), out(lexer_instance),
             array_di, array_uo, di, uo) is det.
 
     % Basically, just read chars from the buf and advance the live lexemes
     % until we have a winner or hit an error (no parse).
     %
-read_0(Result, Instance0, Instance, Buf0, Buf, Src0, Src) :-
+read_2(Result, !Instance, !Buf, !Src) :-
 
-    BufState0    = Instance0 ^ buf_state,
+    some [!BufState] (
+
+        !:BufState = !.Instance ^ buf_state,
+
+        buf__read(BufReadResult, !BufState, !Buf, !Src),
+        (
+            BufReadResult = ok(Char),
+            process_char(Result, Char, !Instance, !.BufState, !Buf, !Src)
+        ;
+            BufReadResult = eof,
+            process_eof(Result, !Instance, !.BufState, !.Buf)
+        )
 
-    buf__read(BufReadResult, BufState0, BufState1, Buf0, Buf1, Src0, Src1),
-    (
-        BufReadResult = ok(Char),
-        process_char(Result, Char,
-                Instance0, Instance, BufState1, Buf1, Buf, Src1, Src)
-    ;
-        BufReadResult = eof,
-        Buf = Buf1,
-        Src = Src1,
-        process_eof(Result, Instance0, Instance, BufState1, Buf)
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred process_char(io__read_result(Tok), char,
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
@@ -343,30 +396,27 @@
 :- mode process_char(out, in, in(lexer_instance), out(lexer_instance),
             in(buf_state), array_di, array_uo, di, uo) is det.
 
-process_char(Result, Char, Instance0, Instance,
-        BufState, Buf0, Buf, Src0, Src) :-
+process_char(Result, Char, !Instance, BufState, !Buf, !Src) :-
 
-    LiveLexemes0 = Instance0 ^ live_lexemes,
-    Winner0      = Instance0 ^ current_winner,
+    LiveLexemes0 = !.Instance ^ live_lexemes,
+    Winner0      = !.Instance ^ current_winner,
 
-    advance_live_lexemes(Char, buf__cursor_offset(BufState),
+    advance_live_lexemes(Char, BufState ^ cursor_offset,
             LiveLexemes0, LiveLexemes, Winner0, Winner),
     (
         LiveLexemes = [],               % Nothing left to consider.
 
-        process_any_winner(Result, Winner, Instance0, Instance, BufState,
-                Buf0, Buf, Src0, Src)
+        process_any_winner(Result, Winner, !Instance, BufState, !Buf, !Src)
     ;
         LiveLexemes = [_ | _],          % Still some open possibilities.
 
-        Instance1 = (((Instance0
-                            ^ live_lexemes   := LiveLexemes)
-                            ^ current_winner := Winner)
-                            ^ buf_state      := BufState),
-        read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
+        !:Instance  = (((!.Instance ^ live_lexemes   := LiveLexemes )
+                                    ^ current_winner := Winner      )
+                                    ^ buf_state      := BufState    ),
+        read_2(Result, !Instance, !Buf, !Src)
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred process_any_winner(io__read_result(Tok), winner(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src), 
@@ -375,43 +425,38 @@
             in(lexer_instance), out(lexer_instance),
             in(buf_state), array_di, array_uo, di, uo) is det.
 
-process_any_winner(Result, yes(TokenCreator - Offset), Instance0, Instance,
-        BufState0, Buf0, Buf, Src0, Src) :-
+process_any_winner(Result, yes(TokenCreator - Offset), !Instance,
+        BufState0, !Buf, !Src) :-
 
-    BufState1 = buf__rewind_cursor(Offset, BufState0),
-    Instance1 = ((( Instance0
-                        ^ live_lexemes   := Instance0 ^ init_lexemes)
-                        ^ current_winner := no)
-                        ^ buf_state      := buf__commit(BufState1)),
-    ( if
+    BufState   = rewind_cursor(Offset, BufState0),
+    IgnorePred = !.Instance ^ ignore_pred,
 
-         get_token_from_buffer(BufState1, Buf0, Instance0, TokenCreator, Token)
+    InitWinnerFunc = !.Instance ^ init_winner_func,
+    !:Instance = ((( !.Instance ^ live_lexemes   := !.Instance ^ init_lexemes )
+                                ^ current_winner := InitWinnerFunc(Offset)    )
+                                ^ buf_state      := commit(BufState)          ),
 
+    ( if
+        get_token_from_buffer(BufState, !.Buf, TokenCreator, IgnorePred, Token)
       then
-    
-         Result   = ok(Token),
-         Instance = Instance1,
-         Buf      = Buf0,
-         Src      = Src0
-    
+        Result = ok(Token)
       else
-    
-         read_0(Result, Instance1, Instance, Buf0, Buf, Src0, Src)
+        read_2(Result, !Instance, !Buf, !Src)
     ).
 
-process_any_winner(Result, no, Instance0, Instance,
-        BufState0, Buf, Buf, Src, Src) :-
+process_any_winner(Result, no, !Instance,
+        BufState0, !Buf, !Src) :-
 
-    Start     = buf__start_offset(BufState0),
-    BufState1 = buf__rewind_cursor(Start + 1, BufState0),
-    Result    = error("input not matched by any regexp", Start),
-    Instance  = ((( Instance0
-                        ^ live_lexemes   :=
-                                Instance0 ^ init_lexemes)
-                        ^ current_winner := no)
-                        ^ buf_state      := buf__commit(BufState1)).
+    Start      = BufState0 ^ start_offset,
+    BufState   = rewind_cursor(Start + 1, BufState0),
+    Result     = error("input not matched by any regexp", Start),
+
+    InitWinnerFunc = !.Instance ^ init_winner_func,
+    !:Instance = ((( !.Instance ^ live_lexemes   := !.Instance ^ init_lexemes )
+                                ^ current_winner := InitWinnerFunc(Start)     )
+                                ^ buf_state      := commit(BufState)          ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred process_eof(io__read_result(Tok),
             lexer_instance(Tok, Src), lexer_instance(Tok, Src),
@@ -419,35 +464,42 @@
 :- mode process_eof(out, in(lexer_instance), out(lexer_instance),
             in(buf_state), array_ui) is det.
 
-process_eof(Result, Instance0, Instance, BufState, Buf) :-
+process_eof(Result, Instance0, Instance, BufState0, Buf) :-
 
-    Result   =
-        ( if
-            live_lexeme_in_accepting_state(Instance0 ^ live_lexemes,
-                        TokenCreator),
-            get_token_from_buffer(BufState, Buf, Instance0,
-                        TokenCreator, Token)
-          then ok(Token)
-          else eof
-        ),
-    Instance = ((Instance0
-                        ^ live_lexemes := [])
-                        ^ buf_state    := buf__commit(BufState)).
-
-%------------------------------------------------------------------------------%
-
-:- pred get_token_from_buffer(buf_state(Src), buf, lexer_instance(Tok, Src),
-                  token_creator(Tok), Tok).
-:- mode get_token_from_buffer(in(buf_state), array_ui, in(lexer_instance),
-                  in(token_creator), out) is semidet.
-
-get_token_from_buffer(BufState, Buf, Instance, TokenCreator, Token) :-
-    Match      = buf__string_to_cursor(BufState, Buf),
-    Token      = TokenCreator(Match),
-    IgnorePred = Instance ^ ignore_pred,
+    ( if
+        Instance0 ^ current_winner = yes(TokenCreator - Offset0),
+
+        IgnorePred = Instance0 ^ ignore_pred,
+        BufState1  = rewind_cursor(Offset0, BufState0),
+
+        get_token_from_buffer(BufState1, Buf, TokenCreator, IgnorePred, Token)
+      then
+        Offset     = Offset0,
+        BufState   = BufState1,
+        Result     = ok(Token)
+      else
+        Offset     = BufState0 ^ start_offset,
+        BufState   = BufState0,
+        Result     = eof
+    ),
+
+    InitWinnerFunc = Instance0 ^ init_winner_func,
+    Instance   = ((( Instance0 ^ live_lexemes   := Instance0 ^ init_lexemes )
+                               ^ current_winner := InitWinnerFunc(Offset)   )
+                               ^ buf_state      := commit(BufState)         ).
+
+%-----------------------------------------------------------------------------%
+
+:- pred get_token_from_buffer(buf_state(Src), buf,
+            token_creator(Tok), ignore_pred(Tok), Tok).
+:- mode get_token_from_buffer(in(buf_state), array_ui,
+            in(token_creator), in(ignore_pred), out) is semidet.
+
+get_token_from_buffer(BufState, Buf, TokenCreator, IgnorePred, Token) :-
+    Token = TokenCreator(string_to_cursor(BufState, Buf)),
     not IgnorePred(Token).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % Note that in the case where two or more lexemes match the same
     % string, the win is given to the earliest such lexeme in the list.
@@ -460,35 +512,32 @@
             out(live_lexeme_list), 
             in(winner), out(winner)) is det.
 
-advance_live_lexemes(_Char, _Offset, [], [], Winner, Winner).
+advance_live_lexemes(_Char, _Offset, [], [], !Winner).
 
-advance_live_lexemes(Char, Offset, [L0 | Ls0], Ls, Winner0, Winner) :-
+advance_live_lexemes(Char, Offset, [L | Ls0], Ls, !Winner) :-
 
-    State0        = L0 ^ state,
-    ATok          = L0 ^ token,
+    State0        = L ^ state,
 
-    ( if next_state(L0, State0, Char, State, IsAccepting) then
+    ( if next_state(L, State0, Char, State, IsAccepting) then
 
         (
-            IsAccepting = no,
-            Winner1     = Winner0
+            IsAccepting = no
         ;
             IsAccepting = yes,
-            Winner1     = ( if   Winner0 = yes(_ATok0 - Offset0),
-                                 Offset  = Offset0
-                            then Winner0
-                            else yes(ATok - Offset)
+            !:Winner    = ( if   !.Winner = yes(_ - Offset)
+                            then !.Winner
+                            else yes(L ^ token - Offset)
                           )
         ),
-        advance_live_lexemes(Char, Offset, Ls0, Ls1, Winner1, Winner),
-        Ls = [( L0 ^ state := State ) | Ls1]
+        advance_live_lexemes(Char, Offset, Ls0, Ls1, !Winner),
+        Ls = [( L ^ state := State ) | Ls1]
 
       else
 
-        advance_live_lexemes(Char, Offset, Ls0, Ls, Winner0, Winner)
+        advance_live_lexemes(Char, Offset, Ls0, Ls, !Winner)
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred live_lexeme_in_accepting_state(list(live_lexeme(Tok)),
                 token_creator(Tok)).
@@ -502,8 +551,8 @@
     ).
 
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % It's much more convenient (especially for integration with, e.g.
     % parsers such as moose) to package up the lexer_instance, buf
@@ -516,7 +565,7 @@
                 src                     :: Src
             ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func args_lexer_state(lexer_instance(Tok, Src), buf, Src) =
             lexer_state(Tok, Src).
@@ -525,23 +574,41 @@
 args_lexer_state(Instance, Buf, Src) = LexerState :-
     unsafe_promise_unique(lexer_state(Instance, Buf, Src), LexerState).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
-:- pred lexer_state_args(lexer_state(Tok,Src),lexer_instance(Tok,Src),buf,Src).
-:- mode lexer_state_args(di, out(lexer_instance), array_uo, uo)  is det.
+:- pred lexer_state_args(lexer_state(Tok, Src), lexer_instance(Tok, Src),
+            buf, Src).
+:- mode lexer_state_args(di, out(lexer_instance),
+            array_uo, uo)  is det.
 
 lexer_state_args(lexer_state(Instance, Buf0, Src0), Instance, Buf, Src) :-
     unsafe_promise_unique(Buf0, Buf),
     unsafe_promise_unique(Src0, Src).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
-manipulate_source(P, State0, State) :-
-    lexer_state_args(State0, Instance, Buf, Src0),
+manipulate_source(P, !State) :-
+    lexer_state_args(!.State, Instance, Buf, Src0),
     P(Src0, Src),
-    State = args_lexer_state(Instance, Buf, Src).
+    !:State = args_lexer_state(Instance, Buf, Src).
 
-%------------------------------------------------------------------------------%
+%----------------------------------------------------------------------------%
+
+read_char(Result, !State) :-
+
+    some [!Instance, !Buf, !Src, !BufState] (
+
+        lexer_state_args(!.State, !:Instance, !:Buf, !:Src),
+
+        !:BufState = !.Instance ^ buf_state,
+        buf__read(Result, !BufState, !Buf, !Src),
+        !:Instance = !.Instance ^ buf_state := commit(!.BufState),
+
+        !:State = args_lexer_state(!.Instance, !.Buf, !.Src)
+
+    ).
+
+%-----------------------------------------------------------------------------%
 
 read_from_stdin(_Offset, Result) -->
     io__read_char(IOResult),
@@ -550,25 +617,29 @@
     ;   IOResult = error(_E),             throw(IOResult)
     }.
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
+    % XXX This is bad for long strings!  We should cache the string
+    % length somewhere rather than recomputing it each time we read
+    % a char.
+    %
 read_from_string(Offset, Result, String, unsafe_promise_unique(String)) :-
     ( if   Offset < string__length(String)
       then Result = ok(string__unsafe_index(String, Offset))
       else Result = eof
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % The type of regular expressions.
 
 :- type regexp
     --->    eps                    % The empty regexp
     ;       atom(char)             % Match a single char
-    ;       conc(regexp,regexp)    % Concatenation
+    ;       conc(regexp, regexp)   % Concatenation
     ;       alt(regexp, regexp)    % Alternation
     ;       star(regexp).          % Kleene closure
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- instance regexp(regexp) where [
     re(RE) = RE
@@ -589,7 +660,7 @@
         )
 ].
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % Basic primitive regexps.
 
  null      = eps.
@@ -598,7 +669,7 @@
 (R1 or R2) = alt(re(R1), re(R2)).
  *(R1)     = star(re(R1)).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % Some basic non-primitive regexps.
 
 any(S) = R :-
@@ -633,7 +704,7 @@
 
 +(R) = (R ++ *(R)).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % Some useful single-char regexps.
 
     % We invite the compiler to memo the values of these constants that
@@ -659,7 +730,7 @@
 tab        = re('\t').
 spc        = re(' ').
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % Some useful compound regexps.
 
 nat        = +(digit).
@@ -672,5 +743,5 @@
 whitespace = *(wspc).
 junk       = *(dot).
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
Index: lex.regexp.m
===================================================================
RCS file: /home/mercury1/repository/mercury/extras/lex/lex.regexp.m,v
retrieving revision 1.2
diff -u -r1.2 lex.regexp.m
--- lex.regexp.m	4 Oct 2001 07:46:04 -0000	1.2
+++ lex.regexp.m	22 Nov 2002 00:14:53 -0000
@@ -1,4 +1,4 @@
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 % vim: ts=4 sw=4 et tw=0 wm=0 ff=unix
 %
 % lex.regexp.m
@@ -16,7 +16,7 @@
 % Converts basic regular expressions into non-deterministic finite
 % automata (NFAs).
 %
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- module lex__regexp.
 
@@ -35,14 +35,14 @@
 :- func remove_null_transitions(state_mc) = state_mc.
 :- mode remove_null_transitions(in) = out(null_transition_free_state_mc) is det.
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- implementation.
 
 :- import_module counter, map, assoc_list, std_util, list, set, string.
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 regexp_to_NFA(R) = NFA :-
     C0 = counter__init(0),
@@ -51,7 +51,7 @@
     compile(Start, R, Stop, Transitions, C, _),
     NFA = state_mc(Start, set__make_singleton_set(Stop), Transitions).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred compile(state_no, regexp, state_no, transitions, counter, counter).
 :- mode compile(in, in, in, out, in, out) is det.
@@ -75,7 +75,7 @@
     compile(X, null, Y, TsA),
     compile(X, R,    X, TsB).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
     % If we have a non-looping null transition from X to Y then
     % we need to add all the transitions from Y to X.
@@ -105,7 +105,7 @@
                 ^ smc_state_transitions := NullFreeTs )
                 ^ smc_stop_states       := StopStates).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred split_transitions(transitions, transitions, transitions).
 :- mode split_transitions(in, out(null_transitions), out(atom_transitions)).
@@ -118,7 +118,7 @@
 split_transitions([trans(X, C, Y) | Ts], NTs, [trans(X, C, Y) | CTs]) :-
     split_transitions(Ts, NTs, CTs).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- type null_map == map(state_no, set(state_no)).
 
@@ -128,7 +128,7 @@
 trans_closure(Ts, Ins0, Ins, Outs0, Outs) :-
     list__foldl2(add_edge, Ts, Ins0, Ins, Outs0, Outs).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred add_edge(transition, null_map, null_map, null_map, null_map).
 :- mode add_edge(in(null_transition), in, out, in, out) is det.
@@ -141,7 +141,7 @@
     Outs = list__foldl(add_to_null_mapping(YOutAndY), Xs, Outs0),
     Ins  = list__foldl(add_to_null_mapping(XInAndX),  Ys, Ins0).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func null_map_lookup(state_no, null_map) = set(state_no).
 
@@ -150,14 +150,14 @@
                                  else set__init
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func add_to_null_mapping(set(state_no), state_no, null_map) = null_map.
 
 add_to_null_mapping(Xs, Y, Map) =
     map__set(Map, Y, Xs `set__union` null_map_lookup(Y, Map)).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func add_atom_transitions(null_map, transitions) = transitions.
 :- mode add_atom_transitions(in, in(atom_transitions)) =
@@ -175,7 +175,7 @@
         )
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func add_atom_transitions_0(transitions, pair(state_no, set(state_no))) =
             transitions.
@@ -187,7 +187,7 @@
         list__map(add_atom_transitions_1(CTs, X), set__to_sorted_list(Ys))
     ).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func add_atom_transitions_1(transitions, state_no, state_no) = transitions.
 :- mode add_atom_transitions_1(in(atom_transitions), in, in) =
@@ -196,7 +196,7 @@
 add_atom_transitions_1(CTs0, X, Y) = CTs :-
     list__filter_map(maybe_copy_transition(X, Y), CTs0, CTs).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- pred maybe_copy_transition(state_no, state_no, transition, transition).
 :- mode maybe_copy_transition(in,in,in(atom_transition),out(atom_transition))
@@ -204,7 +204,7 @@
 
 maybe_copy_transition(X, Y, trans(Y, C, Z), trans(X, C, Z)).
 
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
 
 :- func nulls_to_stop_state(null_map, set(state_no), transition) = state_no.
 :- mode nulls_to_stop_state(in, in, in) = out is semidet.
@@ -215,5 +215,5 @@
         set__member(Z, StopStates)
     ).
 
-%------------------------------------------------------------------------------%
-%------------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
Index: regex.m
===================================================================
RCS file: regex.m
diff -N regex.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ regex.m	22 Nov 2002 05:36:20 -0000
@@ -0,0 +1,566 @@
+%-----------------------------------------------------------------------------%
+% regex.m
+% Ralph Becket <rafe at cs.mu.oz.au>
+% Tue Nov 19 13:01:52 EST 2002
+% vim: ft=mercury ts=4 sw=4 et wm=0 tw=0
+%
+% TODO
+% - Add <regex>{n[,m]} regexps.
+% - Add character classes (e.g. [:space:]) to sets.
+% - Add chomp function to string.m.
+%
+%-----------------------------------------------------------------------------%
+
+:- module regex.
+
+:- interface.
+
+:- import_module string, list.
+:- import_module lex.
+
+    % The type of (compiled) regular expressions.
+    %
+:- type regex.
+:- inst regex == lexer.
+
+    % A function for converting a POSIX style regex string to a regex
+    % suitable for the string matching operations provided in this
+    % module.
+    %
+    % An exception is thrown if the regex string is malformed.  We
+    % memoize this function for efficiency (it is cheaper to look up a
+    % string in a hash table than to parse it and recompute the regex.)
+    %
+    % A regex string obeys the following grammar
+    % (concatenation of <char> takes highest precedence,
+    % otherwise concatenation has lowest precedence):
+    %
+    % <regex> ::= <char>                % Single char
+    %           |  <regex><regex>       % Concatenation
+    %           |  .                    % Any char but \n
+    %           |  <set>                % Any char in set
+    %           |  <regex>|<regex>      % Alternation
+    %           |  <regex>*             % Kleene closure (zero or more)
+    %           |  <regex>+             % One or more occurrences
+    %           |  <regex>?             % Zero or one occurrence
+    %           |  (<regex>)
+    %
+    % (Note the need to use double-backslashes in Mercury strings.
+    % The following chars must be escaped if they are intended
+    % literally: .|*+?()[]\.  Escapes should not be used in sets.)
+    %
+    % <char>   ::= \<any char>          % Literal char used in regexes
+    %           |  <ordinary char>      % All others
+    %
+    % <escaped char> ::= . | | | * | + | ? | ( | ) | [ | ] | \
+    %
+    % (If the first char in a <set'> is ] then it is taken as part of
+    % the char set and not the closing bracket.  Similarly, ] may appear
+    % as the end char in a range and it will not be taken as the closing
+    % bracket.)
+    %
+    % <set>    ::= [^<set'>]            % Any char not in <set'> or '\n'
+    %           |  [<set'>]             % Any char in <set'>
+    %
+    % <set'>   ::= <any char>-<any char>% Any char in range
+    %           |  <any char>           % Literal char
+    %           |  <set'><set'>         % Set union
+    %
+:- func regex(string) = regex.
+:- mode regex(in    ) = out(regex) is det.
+
+    % This is a utility function for lex - it compiles a string into a
+    % regexp (not a regex, which is for use with this module) suitable
+    % for use in lexemes.
+    %
+    % We memoize this function for efficiency (it is cheaper to look up a
+    % string in a hash table than to parse it and recompute the regexp.)
+    %
+:- func regexp(string) = regexp.
+
+    % left_match(Regex, String, Substring, Start, Count)
+    %   succeeds iff Regex maximally matches the first Count characters
+    %   of String.
+    %
+    %   This is equivalent to the goal
+    %
+    %       {Substring, Start, Count} = head(matches(Regex, String)),
+    %       Start = 0
+    %
+:- pred left_match(regex,     string, string, int, int).
+:- mode left_match(in(regex), in,     out,    out, out) is semidet.
+
+    % right_match(Regex, String, Substring, Start, Count)
+    %   succeeds iff Regex maximally matches the last Count characters
+    %   of String.
+    %
+    %   This is equivalent to the goal
+    %
+    %       {Substring, Start, Count} = last(matches(Regex, String)),
+    %       Start + Count = length(String)
+    %
+:- pred right_match(regex,     string, string, int, int).
+:- mode right_match(in(regex), in,     out,    out, out) is semidet.
+
+    % first_match(Regex, String, Substring, Start, Count)
+    %   succeeds iff Regex matches some Substring of String,
+    %   setting Substring, Start and Count to the maximal first
+    %   such occurrence.
+    %
+    %   This is equivalent to the goal
+    %
+    %       {Substring, Start, Count} = head(matches(Regex, String))
+    %
+:- pred first_match(regex,     string, string, int, int).
+:- mode first_match(in(regex), in,     out,    out, out) is semidet.
+
+    % exact_match(Regex, String)
+    %   succeeds iff Regex exactly matches String.
+    %
+:- pred exact_match(regex,     string).
+:- mode exact_match(in(regex), in    ) is semidet.
+
+    % matches(Regex, String) = [{Substring, Start, Count}, ...]
+    %   Regex exactly matches Substring = substring(String, Start, Count).
+    %   None of the {Start, Count} regions will overlap and are in
+    %   ascending order with respect to Start.
+    %
+:- func matches(regex,     string) = list({string, int, int}).
+:- mode matches(in(regex), in    ) = out is det.
+
+    % replace_first(Regex, Replacement, String)
+    %   computes the string formed by replacing the maximal first match
+    %   of Regex (if any) in String with Replacement.
+    %
+:- func replace_first(regex,     string, string) = string.
+:- mode replace_first(in(regex), in,     in    ) = out is det.
+
+    % replace_all(Regex, Replacement, String)
+    %   computes the string formed by replacing the maximal non-overlapping
+    %   matches of Regex in String with Replacement.
+    %
+:- func replace_all(regex,     string, string) = string.
+:- mode replace_all(in(regex), in,     in    ) = out is det.
+
+    % change_first(Regex, ChangeFn, String)
+    %   computes the string formed by replacing the maximal first match
+    %   of Regex (Substring, if any) in String with ChangeFn(Substring).
+    %
+:- func change_first(regex,     func(string) = string,     string) = string.
+:- mode change_first(in(regex), func(in    ) = out is det, in    ) = out is det.
+
+    % change_all(Regex, ChangeFn, String)
+    %   computes the string formed by replacing the maximal non-overlapping
+    %   matches of Regex, Substring, in String with ChangeFn(Substring).
+    %
+:- func change_all(regex,     func(string) = string,     string) = string.
+:- mode change_all(in(regex), func(in    ) = out is det, in    ) = out is det.
+
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+
+:- implementation.
+
+:- import_module int, char, bool, require, std_util, io.
+
+
+
+:- type regex == lexer(string, string).
+
+:- type lexer_state == lexer_state(string, string).
+
+
+
+    % This states of the regex parser.
+    %
+:- type parser_state
+    --->    res(list(re))                        % Parsing as usual.
+    ;       esc(list(re))                        % A character escape.
+    ;       set1(list(re))                       % The char set parsing states.
+    ;       set2(bool, list(re))
+    ;       set3(bool, char, chars, list(re))
+    ;       set4(bool, char, chars, list(re))
+    ;       set5(bool, chars, list(re)).
+
+    % The possible semi-parsed regexes.
+    %
+:- type re
+    --->    re(regexp)                      % An ordinary regex.
+    ;       chars(regexp)                   % A sequence of chars.
+    ;       lpar                            % A left parenthesis.
+    ;       alt.                            % An alternation.
+
+:- type chars == list(char).
+
+%-----------------------------------------------------------------------------%
+
+:- pragma memo(regex/1).
+
+regex(S) = init([regexp(S) - id], read_from_string).
+
+%-----------------------------------------------------------------------------%
+
+:- pragma memo(regexp/1).
+
+regexp(S) = finish_regex(S, foldl(compile_regex(S), S, res([]))).
+
+%-----------------------------------------------------------------------------%
+
+:- func compile_regex(string, char, parser_state) = parser_state.
+
+    % res: we are looking for the next regex or operator.
+    %
+compile_regex(S, C, res(REs)) =
+    (      if C = ('.')  then res([re(dot) | REs])
+      else if C = ('|')  then res([alt | REs])
+      else if C = ('*')  then res(star(S, REs))
+      else if C = ('+')  then res(plus(S, REs))
+      else if C = ('?')  then res(opt(S, REs))
+      else if C = ('(')  then res([lpar | REs])
+      else if C = (')')  then res(rpar(S, REs))
+      else if C = ('[')  then set1(REs)
+      else if C = (']')  then regex_error("`]' without opening `['", S)
+      else if C = ('\\') then esc(REs)
+      else                    res(push_char(C, REs))
+    ).
+
+    % esc: the current char has been \ escaped.
+    %
+compile_regex(_, C, esc(REs)) =
+    res(push_char(C, REs)).
+
+    % set1: we have just seen the opening [.
+    %
+compile_regex(_, C, set1(REs)) =
+    (      if C = ('^') then set2(yes, REs)
+      else                   set3(no,  C, [], REs)
+    ).
+
+    % set2: we are looking for the first char in the set, which may
+    % include ].
+    %
+compile_regex(_, C, set2(Complement, REs)) =
+                             set3(Complement, C, [], REs).
+
+    % set3: we are looking for a char or - or ].
+    %
+compile_regex(_, C, set3(Complement, C0, Cs, REs)) =
+    (      if C = (']') then res([char_set(Complement, [C0 | Cs]) | REs])
+      else if C = ('-') then set4(Complement, C0, Cs, REs)
+      else                   set3(Complement, C, [C0 | Cs], REs)
+    ).
+
+    % set4: we have just seen a `-' for a range.
+    %
+compile_regex(_, C, set4(Complement, C0, Cs, REs)) =
+                             set5(Complement, push_range(C0, C, Cs), REs).
+
+    % set5: we are looking for a char or ].
+    %
+compile_regex(_, C, set5(Complement, Cs, REs)) =
+    (      if C = (']') then res([char_set(Complement, Cs) | REs])
+      else                   set3(Complement, C, Cs, REs)
+    ).
+
+%-----------------------------------------------------------------------------%
+
+    % Turn a list of chars into an any or anybut.
+    %
+:- func char_set(bool, chars) = re.
+
+char_set(no,  Cs) = re(any(from_char_list(Cs))).
+char_set(yes, Cs) = re(anybut(from_char_list([('\n') | Cs]))).
+
+%-----------------------------------------------------------------------------%
+
+    % Push a range of chars onto a char list.
+    %
+:- func push_range(char, char, chars) = chars.
+
+push_range(A, B, Cs) = Rg ++ Cs :-
+    Lo = min(to_int(A), to_int(B)),
+    Hi = max(to_int(A), to_int(B)),
+    Rg = map(int_to_char, Lo `..` Hi).
+
+
+:- func int_to_char(int) = char.
+
+int_to_char(X) =
+    ( if char__to_int(C, X) then C else func_error("regex__int_to_char") ).
+
+%-----------------------------------------------------------------------------%
+
+:- func finish_regex(string, parser_state) = regexp.
+
+finish_regex(S, esc(_)) =
+    regex_error("expected char after `\\'", S).
+
+finish_regex(S, set1(_)) =
+    regex_error("`[' without closing `]'", S).
+
+finish_regex(S, set2(_, _)) =
+    regex_error("`[' without closing `]'", S).
+
+finish_regex(S, set3(_, _, _, _)) =
+    regex_error("`[' without closing `]'", S).
+
+finish_regex(S, set4(_, _, _, _)) =
+    regex_error("`[' without closing `]'", S).
+
+finish_regex(S, set5(_, _, _)) =
+    regex_error("`[' without closing `]'", S).
+
+finish_regex(S, res(REs)) =
+    ( if   rpar(S, REs ++ [lpar]) = [re(RE)]
+      then RE
+      else regex_error("`(' without closing `)'", S)
+    ).
+
+%-----------------------------------------------------------------------------%
+
+    % Push a char regex.
+    %
+:- func push_char(char, list(re)) = list(re).
+
+push_char(C, REs) =
+    ( if   REs = [chars(Cs) | REs0]
+      then [chars(Cs ++ C)  | REs0]
+      else [chars(re(C))    | REs ]
+    ).
+
+%-----------------------------------------------------------------------------%
+
+    % The *, + and ? regexes.
+    %
+:- func star(string, list(re)) = list(re).
+
+star(S, REs) =
+    ( if   ( REs = [re(RE) | REs0] ; REs = [chars(RE) | REs0] )
+      then [re(*(RE)) | REs0]
+      else regex_error("`*' without preceding regex", S)
+    ).
+
+:- func plus(string, list(re)) = list(re).
+
+plus(S, REs) =
+    ( if   ( REs = [re(RE) | REs0] ; REs = [chars(RE) | REs0] )
+      then [re(+(RE)) | REs0]
+      else regex_error("`+' without preceding regex", S)
+    ).
+
+:- func opt(string, list(re)) = list(re).
+
+opt(S, REs) =
+    ( if   ( REs = [re(RE) | REs0] ; REs = [chars(RE) | REs0] )
+      then [re(?(RE)) | REs0]
+      else regex_error("`?' without preceding regex", S)
+    ).
+
+%-----------------------------------------------------------------------------%
+
+    % Handle a closing parenthesis.
+    %
+:- func rpar(string, list(re)) = list(re).
+
+rpar(S, REs) =
+    (      if REs = [alt, lpar               | REs0]
+      then    [nil                           | REs0]
+
+      else if REs = [RE_A, alt, lpar         | REs0]
+      then    [alt(nil, RE_A)                | REs0]
+
+      else if REs = [RE_A, alt, RE_B         | REs0]
+      then    rpar(S, [alt(RE_B, RE_A)       | REs0])
+
+      else if REs = [lpar                    | REs0]
+      then    [nil                           | REs0]
+
+      else if REs = [RE, lpar                | REs0]
+      then    [RE                            | REs0]
+
+      else if REs = [RE_A, RE_B              | REs0]
+      then    rpar(S, [concat(RE_B, RE_A)    | REs0])
+
+      else    regex_error("`)' without opening `('", S)
+    ).
+
+%-----------------------------------------------------------------------------%
+
+    % Handle the alternation of two res.
+    %
+:- func alt(re, re) = re.
+
+alt(A, B) = re(extract_regex(A) or extract_regex(B)).
+
+    % Handle the concatenation of two res.
+    %
+:- func concat(re, re) = re.
+
+concat(A, B) = re(extract_regex(A) ++ extract_regex(B)).
+
+
+:- func extract_regex(re) = regexp.
+
+extract_regex(re(A))    = A.
+extract_regex(chars(A)) = A.
+extract_regex(alt)      = func_error("regex__extract_regex").
+extract_regex(lpar)     = func_error("regex__extract_regex").
+
+%-----------------------------------------------------------------------------%
+
+    % Throw a wobbly.
+    %
+:- func regex_error(string, string) = _.
+:- mode regex_error(in, in) = out is erroneous.
+
+regex_error(Msg, String) =
+    func_error("regex: " ++ Msg ++ " in \"" ++ String ++ "\"").
+
+%-----------------------------------------------------------------------------%
+
+    % The empty regex.
+    %
+:- func nil = re.
+
+nil = re(re("")).
+
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+
+left_match(Regex, String, Substring, 0, length(Substring)) :-
+    State = start(Regex, unsafe_promise_unique(String)),
+    lex__read(ok(Substring), State, _).
+
+%-----------------------------------------------------------------------------%
+
+    % We have to keep trying successive suffixes of String until
+    % we find a complete match.
+    %
+right_match(Regex, String, Substring, Start, Count) :-
+    right_match_2(Regex, String, 0, length(String), Substring, Start, Count).
+
+
+:- pred right_match_2(regex,     string, int, int, string, int, int).
+:- mode right_match_2(in(regex), in,     in,  in,  out,    out, out) is semidet.
+
+right_match_2(Regex, String, I, Length, Substring, Start, Count) :-
+    I =< Length,
+    Substring0 = substring(String, I, Length),
+    ( if exact_match(Regex, Substring0) then
+        Substring = Substring0,
+        Start     = I,
+        Count     = Length
+      else
+        right_match_2(Regex, String, I + 1, Length - 1, Substring, Start, Count)
+    ).
+
+%-----------------------------------------------------------------------------%
+
+first_match(Regex, String, Substring, Start, length(Substring)) :-
+    State = start(Regex, unsafe_promise_unique(String)),
+    first_match_2(Substring, Start, State).
+
+
+:- pred first_match_2(string, int, lexer_state).
+:- mode first_match_2(out,    out, di         ) is semidet.
+
+first_match_2(Substring, Start, !.State) :-
+    offset_from_start(Start0, !State),
+    lex__read(Result,         !State),
+    (
+        Result = error(_, _),
+        first_match_2(Substring, Start, !.State)
+    ;
+        Result = ok(Substring),
+        Start  = Start0
+    ).
+
+%-----------------------------------------------------------------------------%
+
+exact_match(Regex, String) :-
+    State = start(Regex, unsafe_promise_unique(String)),
+    lex__read(ok(String), State, _).
+
+%-----------------------------------------------------------------------------%
+
+matches(Regex, String) = Matches :-
+    State   = start(Regex, unsafe_promise_unique(String)),
+    Matches = matches_2(State).
+
+
+:- func matches_2(lexer_state) = list({string, int, int}).
+:- mode matches_2(di)          = out is det.
+
+matches_2(State0) = Matches :-
+    offset_from_start(Start0, State0, State1),
+    lex__read(Result, State1, State2),
+    (
+        Result  = eof,
+        Matches = []
+    ;
+        Result  = error(_, _),
+        Matches = matches_2(State2)
+    ;
+        Result  = ok(Substring),
+        Start   = Start0,
+        Count   = length(Substring),
+
+            % If we matched the empty string then we have to advance
+            % at least one char.  Finish if we get eof.
+            %
+        Matches =
+            [ {Substring, Start, Count} |
+              ( if Count = 0 then
+                  ( if lex__read_char(ok(_), State2, State3) then
+                      matches_2(State3)
+                    else
+                      []
+                  )
+                else
+                  matches_2(State2)
+              )
+            ]
+        ).
+
+%-----------------------------------------------------------------------------%
+
+replace_first(Regex, Replacement, String) =
+    change_first(Regex, func(_) = Replacement, String).
+
+%-----------------------------------------------------------------------------%
+
+replace_all(Regex, Replacement, String) =
+    change_all(Regex, func(_) = Replacement, String).
+
+%-----------------------------------------------------------------------------%
+
+change_first(Regex, ChangeFn, String) =
+    ( if first_match(Regex, String, Substring, Start, Count) then
+        append_list([
+            substring(String, 0, Start),
+            ChangeFn(Substring),
+            substring(String, Start + Count, max_int)
+        ])
+      else
+        String
+    ).
+
+%-----------------------------------------------------------------------------%
+
+change_all(Regex, ChangeFn, String) =
+    append_list(change_all_2(String, ChangeFn, 0, matches(Regex, String))).
+
+
+:- func change_all_2(string, func(string) = string, int,
+            list({string, int, int})) = list(string).
+
+change_all_2(String, _ChangeFn, I, []) =
+    [ substring(String, I, max_int) ].
+
+change_all_2(String, ChangeFn, I, [{Substring, Start, Count} | Matches]) =
+    [ substring(String, I, Start - I),
+      ChangeFn(Substring)
+    | change_all_2(String, ChangeFn, Start + Count, Matches) ].
+
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
Index: test_regex.m
===================================================================
RCS file: test_regex.m
diff -N test_regex.m
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ test_regex.m	22 Nov 2002 05:44:14 -0000
@@ -0,0 +1,94 @@
+%-----------------------------------------------------------------------------%
+% test_regex.m
+% Ralph Becket <rafe at cs.mu.oz.au>
+% Thu Nov 21 15:33:48 EST 2002
+% vim: ft=mercury ts=4 sw=4 et wm=0 tw=0
+%
+%-----------------------------------------------------------------------------%
+
+:- module test_regex.
+
+:- interface.
+
+:- import_module io.
+
+
+
+:- pred main(io::di, io::uo) is det.
+
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
+
+:- implementation.
+
+:- import_module int, string, list.
+:- import_module lex, regex.
+
+%-----------------------------------------------------------------------------%
+
+main(!IO) :-
+    S = "(xy?ab+[0-9]*)|[aeiouw-z]",
+    io__format("parsing against regex(\"%s\")\n", [s(S)], !IO),
+    loop(regex(S), !IO).
+
+:- pred loop(regex, io, io).
+:- mode loop(in(regex), di, uo) is det.
+
+loop(R, !IO) :-
+    io__format("\n> ", [], !IO),
+    io__read_line_as_string(Res, !IO),
+    (
+        Res = eof
+    ;
+        Res = error(_),
+        io__format("*** error: ", [], !IO),
+        io__print(Res, !IO),
+        io__format(" ***\n", [], !IO)
+    ;
+        Res = ok(S0),
+        S   = chomp(S0),
+        ( if M = matches(R, S), M \= [] then
+            io__format("all matches             : ", [], !IO),
+            io__print(matches(R, S), !IO),
+            io__nl(!IO),
+
+            io__format("replace_first with `<>' : \"%s\"\n",
+                [s(replace_first(R, "<>", S))], !IO),
+
+            io__format("replace_all with `<>'   : \"%s\"\n",
+                [s(replace_all(R, "<>", S))], !IO),
+
+            ChgFn = (func(Str) = append_list(["<", Str, ">"])),
+
+            io__format("change_first to `<&>'   : \"%s\"\n",
+                [s(change_first(R, ChgFn, S))], !IO),
+
+            io__format("change_all to `<&>'     : \"%s\"\n",
+                [s(change_all(R, ChgFn, S))], !IO)
+
+          else true
+        ),
+        ( if left_match(R, S, LSub, LS, LC) then
+            io__format("left match              : {\"%s\", %d, %d}\n",
+                    [s(LSub), i(LS), i(LC)], !IO)
+          else true
+        ),
+        ( if right_match(R, S, RSub, RS, RC) then
+            io__format("right match             : {\"%s\", %d, %d}\n",
+                    [s(RSub), i(RS), i(RC)], !IO)
+          else true
+        ),
+        ( if first_match(R, S, FSub, FS, FC) then
+            io__format("first match             : {\"%s\", %d, %d}\n",
+                    [s(FSub), i(FS), i(FC)], !IO)
+          else true
+        ),
+        loop(R, !IO)
+    ).
+
+:- func chomp(string) = string.
+
+chomp(S) = ( if string__remove_suffix(S, "\n", T) then T else S ).
+
+%-----------------------------------------------------------------------------%
+%-----------------------------------------------------------------------------%
--------------------------------------------------------------------------
mercury-reviews mailing list
post:  mercury-reviews at cs.mu.oz.au
administrative address: owner-mercury-reviews at cs.mu.oz.au
unsubscribe: Address: mercury-reviews-request at cs.mu.oz.au Message: unsubscribe
subscribe:   Address: mercury-reviews-request at cs.mu.oz.au Message: subscribe
--------------------------------------------------------------------------



More information about the reviews mailing list