[m-dev.] regular expressions

Fergus Henderson fjh at cs.mu.OZ.AU
Sat Dec 18 13:24:51 AEDT 1999


On 17-Dec-1999, Michael Day <mcda at cat.cs.mu.OZ.AU> wrote:
> 
> > letter --> 'a' .. 'z' ; 'A' .. 'Z'.
> > digit --> '0' .. '9'.
> > c_identifier --> letter, repeat0((letter ; digit ; ['_'])).
> 
> I must admit, being able to write code in that style would probably make
> me drool all over the keyboard. But, I can't help wondering... while
> that's a nice example, it is perhaps a little too simplistic, given that
> the definition of letter is locale dependent and is based on the ctype
> functions rather than a mere character range. That would lead to a rather
> ugly switch forest or table of some kind.

That's OK, you can call arbitrary Mercury code from your DCG:

	:- use_module char.

	letter --> [Char], { char__is_alpha(Char) }.

> The repeat0/repeat1 predicates
> look nice.

Actually you probably couldn't write it exactly like that in Mercury.
It would need to be

	c_identifier --> letter, repeat0((pred(in, out) is semidet -->
					  letter ; digit ; ['_'])).

or

	c_identifier --> letter, repeat0(following_char).

	:- mode following_char(in, out) is semidet.
	following_char --> letter ; digit ; ['_'].

> I suppose it would be asking too much for them to be defined as
> postfix * and + operators, respectively :)

Yes, that conflicts with the use of * and + as infix operators.
The Prolog-style operator precedence parser that we use does
not allow an operator to be both infix and postfix.

> One last question, how can a
> program convert from canonical regular expression format into this
> compiled representation? I mean it's great for constant regexps, but what
> about reading one from the user or a file...

Well, you can define a type for representing regexps, e.g.

	:- import_module list, char, string.

	:- type regexp 
		--->	char(char)
		;	empty
		;	sequence(regexp, regexp)
		;	repeat0(regexp)
		;	repeat1(regexp).

and then you can write code to parse values of this type
from the usual syntax for regular expressions:

	% parse a regular expression
	:- pred parse_regexp(string::in, regexp::out) is semidet.
	parse_regexp(String, RegExp) :-
		regexp(RegExp, to_char_list(String), []).

	:- pred regexp(regexp::out, list(char)::in, list(char)::out) is det.
	regexp(RegExp) -->
		( single_regexp(R) ->
			regexp(Rs),
			{ RegExp = (Rs = empty -> R ; sequence(R, Rs)) }
		;
			{ RegExp = empty }
		).

	:- pred single_regexp(regexp::out, list(char)::in, list(char)::out)
		is semidet.
	single_regexp(R) -->
		basic_regexp(R0),
		( ['*'] ->
			{ R = repeat0(R0) }
		; ['+'] ->
			{ R = repeat1(R0) }
		;
			{ R = R0 }
		).

	basic_regexp(R) -->
		( ['('] -> regexp(R0), [')'],
			{ R = R0 }
		; [Char],
			{ R = char(Char) }
		).

Then you can write code to match against an arbitrary regular expression:

	% match a string against a regular expression
	:- pred match(regexp::in, list(char)::in, list(char)::out) is semidet.
	match(empty) --> [].
	match(sequence(R, Rs)) --> match(R), match(Rs).
	match(char(Char)) --> [Char].
	match(repeat0(RegExp)) --> repeat0(match(RegExp)).
	match(repeat1(RegExp)) --> repeat1(match(RegExp)).

Oh, and here's how you can define repeat0 and repeat1:

	repeat0(X) --> ( X -> repeat0(X) ; [] ).
	repeat1(X) --> X, repeat0(X).

-- 
Fergus Henderson <fjh at cs.mu.oz.au>  |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>  |  of excellence is a lethal habit"
PGP: finger fjh at 128.250.37.3        |     -- the last words of T. S. Garp.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions:          mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------



More information about the developers mailing list