[m-dev.] regular expressions
Fergus Henderson
fjh at cs.mu.OZ.AU
Sat Dec 18 13:24:51 AEDT 1999
On 17-Dec-1999, Michael Day <mcda at cat.cs.mu.OZ.AU> wrote:
>
> > letter --> 'a' .. 'z' ; 'A' .. 'Z'.
> > digit --> '0' .. '9'.
> > c_identifier --> letter, repeat0((letter ; digit ; ['_'])).
>
> I must admit, being able to write code in that style would probably make
> me drool all over the keyboard. But, I can't help wondering... while
> that's a nice example, it is perhaps a little too simplistic, given that
> the definition of letter is locale dependent and is based on the ctype
> functions rather than a mere character range. That would lead to a rather
> ugly switch forest or table of some kind.
That's OK, you can call arbitrary Mercury code from your DCG:
:- use_module char.
letter --> [Char], { char__is_alpha(Char) }.
> The repeat0/repeat1 predicates
> look nice.
Actually you probably couldn't write it exactly like that in Mercury.
It would need to be
c_identifier --> letter, repeat0((pred(in, out) is semidet -->
letter ; digit ; ['_'])).
or
c_identifier --> letter, repeat0(following_char).
:- mode following_char(in, out) is semidet.
following_char --> letter ; digit ; ['_'].
> I suppose it would be asking too much for them to be defined as
> postfix * and + operators, respectively :)
Yes, that conflicts with the use of * and + as infix operators.
The Prolog-style operator precedence parser that we use does
not allow an operator to be both infix and postfix.
> One last question, how can a
> program convert from canonical regular expression format into this
> compiled representation? I mean it's great for constant regexps, but what
> about reading one from the user or a file...
Well, you can define a type for representing regexps, e.g.
:- import_module list, char, string.
:- type regexp
---> char(char)
; empty
; sequence(regexp, regexp)
; repeat0(regexp)
; repeat1(regexp).
and then you can write code to parse values of this type
from the usual syntax for regular expressions:
% parse a regular expression
:- pred parse_regexp(string::in, regexp::out) is semidet.
parse_regexp(String, RegExp) :-
regexp(RegExp, to_char_list(String), []).
:- pred regexp(regexp::out, list(char)::in, list(char)::out) is det.
regexp(RegExp) -->
( single_regexp(R) ->
regexp(Rs),
{ RegExp = (Rs = empty -> R ; sequence(R, Rs)) }
;
{ RegExp = empty }
).
:- pred single_regexp(regexp::out, list(char)::in, list(char)::out)
is semidet.
single_regexp(R) -->
basic_regexp(R0),
( ['*'] ->
{ R = repeat0(R0) }
; ['+'] ->
{ R = repeat1(R0) }
;
{ R = R0 }
).
basic_regexp(R) -->
( ['('] -> regexp(R0), [')'],
{ R = R0 }
; [Char],
{ R = char(Char) }
).
Then you can write code to match against an arbitrary regular expression:
% match a string against a regular expression
:- pred match(regexp::in, list(char)::in, list(char)::out) is semidet.
match(empty) --> [].
match(sequence(R, Rs)) --> match(R), match(Rs).
match(char(Char)) --> [Char].
match(repeat0(RegExp)) --> repeat0(match(RegExp)).
match(repeat1(RegExp)) --> repeat1(match(RegExp)).
Oh, and here's how you can define repeat0 and repeat1:
repeat0(X) --> ( X -> repeat0(X) ; [] ).
repeat1(X) --> X, repeat0(X).
--
Fergus Henderson <fjh at cs.mu.oz.au> | "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh> | of excellence is a lethal habit"
PGP: finger fjh at 128.250.37.3 | -- the last words of T. S. Garp.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to: mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions: mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------
More information about the developers
mailing list