[m-dev.] regular expressions

Peter Schachte peter at koan.cs.mu.OZ.AU
Thu Dec 16 01:48:15 AEDT 1999


On Wed, Dec 15, 1999 at 10:55:37PM +1100, Michael Day wrote:
> if regular expressions were added to the standard library, what kind of
> functionality would people like to see?

[...snip...]

> My primary question though is related to implementation. Is it customary
> for languages to roll their own regular expression functionality or simply
> wrap the POSIX library for it? Judging previous perl related comments,
> perl style regular expressions would not be welcome here? :)

Firstly, Perl is pretty much a hack (I can say that because I kind of
like Perl), which is why it's not liked here.  But that doesn't mean
that it's regexp facility wouldn't be respected.  I can't see why
people here wouldn't like perl regexps, except maybe "too many
features."  It has several nice features to recommend it.

But I think a better question to start with is:  would people prefer a
regexp library designed specifically to suit a language like Mercury,
or would they prefer the comfort of the usual approach.  There's
evidence to suggest the latter (eg, the Mercury approach to formatted
output).  If that's the preferred approach, your proposal sounds
reasonable.  I'd prefer the former, however.

To me, a regexp is a predicate with one input argument:  a (linear)
sequence (it doesn't have to be a string, or even a sequence of
characters; it could even be extended to non-linear structures, but
every time I think about that, I sprain my brain).  It would usually
also have an output sequence argument, specifying the rest of the
sequence after removing whatever syntactic element it matches, and it
may have other outputs as well, such as character positions,
substrings, etc.  DCGs are really pretty nice for this kind of
thing.  They handle sequences and alternatives quite nicely.
Repetition could be handled with a higher order predicate like

    :- pred repeat0(pred(S, S), S, S) <= sequence(S,T).
    :- pred repeat1(pred(S, S), S, S) <= sequence(S,T).

(I've probably messed up the type class part of this.  The first of
these would be like `*' and the second like `+' in usual regexp
implementations.)

Character classes could be handled by something like

    :- pred '..'(T,T,S,S) <= sequence(S,T), ord(T).
    :- mode '..'(in, in, in, out) is semidet.

    Lo .. Hi -->
	    [Next],
	    { Lo =< Next, Next =< Hi}.

Then you could write:

letter --> 'a' .. 'z' ; 'A' .. 'Z'.
digit --> '0' .. '9'.
c_identifier --> letter, repeat0((letter ; digit ; ['_'])).

(assuming `..' is defined to be an infix operator which binds tighter
than `;'.)

Of course, teaching the compiler to generate good code for such
constructs is the fun part....

-- 
Peter Schachte                | I have made this a rather long letter
mailto:pets at cs.mu.OZ.AU       | because I haven't had time to make it
http://www.cs.mu.oz.au/~pets/ | shorter.
PGP: finger pets at 128.250.37.3 |     -- Blaise Pascal 
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions:          mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------



More information about the developers mailing list