man Regexp::Common () - Provide commonly requested regular expressions

NAME

Regexp::Common - Provide commonly requested regular expressions

SYNOPSIS

 # STANDARD USAGE

 use Regexp::Common;

 while (<>) {
     /$RE{num}{real}/               and print q{a number};
     /$RE{quoted}                   and print q{a ['"`] quoted string};
     /$RE{delimited}{-delim=>'/'}/  and print q{a /.../ sequence};
     /$RE{balanced}{-parens=>'()'}/ and print q{balanced parentheses};
     /$RE{profanity}/               and print q{a #*@%-ing word};
 }

 # SUBROUTINE-BASED INTERFACE

 use Regexp::Common 'RE_ALL';

 while (<>) {
     $_ =~ RE_num_real()              and print q{a number};
     $_ =~ RE_quoted()                and print q{a ['"`] quoted string};
     $_ =~ RE_delimited(-delim=>'/')  and print q{a /.../ sequence};
     $_ =~ RE_balanced(-parens=>'()'} and print q{balanced parentheses};
     $_ =~ RE_profanity()             and print q{a #*@%-ing word};
 }

 # IN-LINE MATCHING...

 if ( $RE{num}{int}->matches($text) ) {...}

 # ...AND SUBSTITUTION

 my $cropped = $RE{ws}{crop}->subs($uncropped);

 # ROLL-YOUR-OWN PATTERNS

 use Regexp::Common 'pattern';

 pattern name   => ['name', 'mine'],
         create => '(?i:J[.]?\s+A[.]?\s+Perl-Hacker)',
         ;

 my $name_matcher = $RE{name}{mine};

 pattern name    => [ 'lineof', '-char=_' ],
         create  => sub {
                        my $flags = shift;
                        my $char = quotemeta $flags->{-char};
                        return '(?:^$char+$)';
                    },
         matches => sub {
                        my ($self, $str) = @_;
                        return $str !~ /[^$self->{flags}{-char}]/;
                    },
         subs   => sub {
                        my ($self, $str, $replacement) = @_;
                        $_[1] =~ s/^$self->{flags}{-char}+$//g;
                   },
         ;

 my $asterisks = $RE{lineof}{-char=>'*'};

 # DECIDING WHICH PATTERNS TO LOAD.

 use Regexp::Common qw /comment number/;  # Comment and number patterns.
 use Regexp::Common qw /no_defaults/;     # Don't load any patterns.
 use Regexp::Common qw /!delimited/;      # All, but delimited patterns.

DESCRIPTION

By default, this module exports a single hash (CW%RE) that stores or generates commonly needed regular expressions (see List of available patterns).

There is an alternative, subroutine-based syntax described in Subroutine-based interface.

General syntax for requesting patterns

To access a particular pattern, CW%RE is treated as a hierarchical hash of hashes (of hashes...), with each successive key being an identifier. For example, to access the pattern that matches real numbers, you specify:

        $RE{num}{real}

and to access the pattern that matches integers:

        $RE{num}{int}

Deeper layers of the hash are used to specify flags: arguments that modify the resulting pattern in some way. The keys used to access these layers are prefixed with a minus sign and may have a value; if a value is given, it's done by using a multidimensional key. For example, to access the pattern that matches base-2 real numbers with embedded commas separating groups of three digits (e.g. 10,101,110.110101101):

        $RE{num}{real}{-base => 2}{-sep => ','}{-group => 3}

Through the magic of Perl, these flag layers may be specified in any order (and even interspersed through the identifier keys!) so you could get the same pattern with:

        $RE{num}{real}{-sep => ','}{-group => 3}{-base => 2}

or:

        $RE{num}{-base => 2}{real}{-group => 3}{-sep => ','}

or even:

        $RE{-base => 2}{-group => 3}{-sep => ','}{num}{real}

etc.

Note, however, that the relative order of amongst the identifier keys is significant. That is:

        $RE{list}{set}

would not be the same as:

        $RE{set}{list}

Flag syntax

In versions prior to 2.113, flags could also be written as CW{"-flag=value"}. This no longer works, although CW{"-flag$;value"} still does. However, CW{-flag => 'value'} is the preferred syntax.

Universal flags

Normally, flags are specific to a single pattern. However, there is two flags that all patterns may specify. By default, the patterns provided by CW%RE contain no capturing parentheses. However, if the CW-keep flag is specified (it requires no value) then any significant substrings that the pattern matches are captured. For example:

        if ($str =~ $RE{num}{real}{-keep}) {
                $number   = $1;
                $whole    = $3;
                $decimals = $5;
        }
Special care is needed if a kept pattern is interpolated into a larger regular expression, as the presence of other capturing parentheses is likely to change the number variables into which significant substrings are saved. See also Adding new regular expressions, which describes how to create new patterns with optional capturing brackets that respond to CW-keep. Some patterns or subpatterns only match lowercase or uppercase letters. If one wants the do case insensitive matching, one option is to use the CW/i regexp modifier, or the special sequence CW(?i). But if the functional interface is used, one does not have this option. The CW-i switch solves this problem; by using it, the pattern will do case insensitive matching.

OO interface and inline matching/substitution

The patterns returned from CW%RE are objects, so rather than writing:

        if ($str =~ /$RE{some}{pattern}/ ) {...}
you can write:
        if ( $RE{some}{pattern}->matches($str) ) {...}
For matching this would seem to have no great advantage apart from readability (but see below). For substitutions, it has other significant benefits. Frequently you want to perform a substitution on a string without changing the original. Most people use this:
        $changed = $original;
        $changed =~ s/$RE{some}{pattern}/$replacement/;
The more adept use:
        ($changed = $original) =~ s/$RE{some}{pattern}/$replacement/;
Regexp::Common allows you do write this:
        $changed = $RE{some}{pattern}->subs($original=>$replacement);
Apart from reducing precedence-angst, this approach has the added advantages that the substitution behaviour can be optimized from the regular expression, and the replacement string can be provided by default (see Adding new regular expressions). For example, in the implementation of this substitution:
        $cropped = $RE{ws}{crop}->subs($uncropped);
the default empty string is provided automatically, and the substitution is optimized to use:
        $uncropped =~ s/^\s+//;
        $uncropped =~ s/\s+$//;
rather than:
        $uncropped =~ s/^\s+|\s+$//g;

Subroutine-based interface

The hash-based interface was chosen because it allows regexes to be effortlessly interpolated, and because it also allows them to be curried. For example:

        my $num = $RE{num}{int};
        my $commad     = $num->{-sep=>','}{-group=>3};
        my $duodecimal = $num->{-base=>12};
However, the use of tied hashes does make the access to Regexp::Common patterns slower than it might otherwise be. In contexts where impatience overrules laziness, Regexp::Common provides an additional subroutine-based interface. For each (sub-)entry in the CW%RE hash (CW$RE{key1}{key2}{etc}), there is a corresponding exportable subroutine: CWRE_key1_key2_etc(). The name of each subroutine is the underscore-separated concatenation of the non-flag keys that locate the same pattern in CW%RE. Flags are passed to the subroutine in its argument list. Thus:
        use Regexp::Common qw( RE_ws_crop RE_num_real RE_profanity );
        $str =~ RE_ws_crop() and die "Surrounded by whitespace";
        $str =~ RE_num_real(-base=>8, -sep=>" ") or next;
        $offensive = RE_profanity(-keep);
        $str =~ s/$offensive/$bad{$1}++; "<expletive deleted>"/ge;
Note that, unlike the hash-based interface (which returns objects), these subroutines return ordinary CWqr'd regular expressions. Hence they do not curry, nor do they provide the OO match and substitution inlining described in the previous section. It is also possible to export subroutines for all available patterns like so:
        use Regexp::Common 'RE_ALL';
Or you can export all subroutines with a common prefix of keys like so:
        use Regexp::Common 'RE_num_ALL';
which will export CWRE_num_int and CWRE_num_real (and if you have create more patterns who have first key num, those will be exported as well). In general, RE_key1_..._keyn_ALL will export all subroutines whose pattern names have first keys key1 ... keyn.

Adding new regular expressions

You can add your own regular expressions to the CW%RE hash at run-time, using the exportable CWpattern subroutine. It expects a hash-like list of key/value pairs that specify the behaviour of the pattern. The various possible argument pairs are: A required argument that specifies the name of the pattern, and any flags it may take, via a reference to a list of strings. For example:

         pattern name => [qw( line of -char )],
                 # other args here
                 ;
This specifies an entry CW$RE{line}{of}, which may take a CW-char flag. Flags may also be specified with a default value, which is then used whenever the flag is omitted, or specified without an explicit value. For example:
         pattern name => [qw( line of -char=_ )],
                 # default char is '_'
                 # other args here
                 ;
A required argument that specifies either a string that is to be returned as the pattern:
        pattern name    => [qw( line of underscores )],
                create  => q/(?:^_+$)/
                ;
or a reference to a subroutine that will be called to create the pattern:
        pattern name    => [qw( line of -char=_ )],
                create  => sub {
                                my ($self, $flags) = @_;
                                my $char = quotemeta $flags->{-char};
                                return '(?:^$char+$)';
                            },
                ;
If the subroutine version is used, the subroutine will be called with three arguments: a reference to the pattern object itself, a reference to a hash containing the flags and their values, and a reference to an array containing the non-flag keys. Whatever the subroutine returns is stringified as the pattern. No matter how the pattern is created, it is immediately postprocessed to include or exclude capturing parentheses (according to the value of the CW-keep flag). To specify such optional capturing parentheses within the regular expression associated with CWcreate, use the notation CW(?k:...). Any parentheses of this type will be converted to CW(...) when the CW-keep flag is specified, or CW(?:...) when it is not. It is a Regexp::Common convention that the outermost capturing parentheses always capture the entire pattern, but this is not enforced. An optional argument that specifies a subroutine that is to be called when the CW$RE{...}->matches(...) method of this pattern is invoked. The subroutine should expect two arguments: a reference to the pattern object itself, and the string to be matched against. It should return the same types of values as a CWm/.../ does.
     pattern name    => [qw( line of -char )],
             create  => sub {...},
             matches => sub {
                             my ($self, $str) = @_;
                             $str !~ /[^$self->{flags}{-char}]/;
                        },
             ;
An optional argument that specifies a subroutine that is to be called when the CW$RE{...}->subs(...) method of this pattern is invoked. The subroutine should expect three arguments: a reference to the pattern object itself, the string to be changed, and the value to be substituted into it. The third argument may be CWundef, indicating the default substitution is required. The subroutine should return the same types of values as an CWs/.../.../ does. For example:
     pattern name    => [ 'lineof', '-char=_' ],
             create  => sub {...},
             subs    => sub {
                          my ($self, $str, $ignore_replacement) = @_;
                          $_[1] =~ s/^$self->{flags}{-char}+$//g;
                        },
             ;
Note that such a subroutine will almost always need to modify CW$_[1] directly. If this argument is given, it specifies the minimum version of perl required to use the new pattern. Attempts to use the pattern with earlier versions of perl will generate a fatal diagnostic.

Loading specific sets of patterns.

By default, all the sets of patterns listed below are made available. However, it is possible to indicate which sets of patterns should be made available - the wanted sets should be given as arguments to CWuse. Alternatively, it is also possible to indicate which sets of patterns should not be made available - those sets will be given as argument to the CWuse statement, but are preceeded with an exclaimation mark. The argument no_defaults indicates none of the default patterns should be made available. This is useful for instance if all you want is the CWpattern() subroutine. Examples:

 use Regexp::Common qw /comment number/;  # Comment and number patterns.
 use Regexp::Common qw /no_defaults/;     # Don't load any patterns.
 use Regexp::Common qw /!delimited/;      # All, but delimited patterns.
It's also possible to load your own set of patterns. If you have a module CWRegexp::Common::my_patterns that makes patterns available, you can have it made available with
 use Regexp::Common qw /my_patterns/;
Note that the default patterns will still be made available - only if you use no_defaults, or mention one of the default sets explicitely, the non mentioned defaults aren't made available.

List of available patterns

The patterns listed below are currently available. Each set of patterns has its own manual page describing the details. For each pattern set named name, the manual page Regexp::Common::name describes the details. Currently available are:

Regexp::Common::balanced
Provides regexes for strings with balanced parenthesized delimiters.
Regexp::Common::comment
Provides regexes for comments of various languages (43 languages currently).
Regexp::Common::delimited
Provides regexes for delimited strings.
Regexp::Common::lingua
Provides regexes for palindromes.
Regexp::Common::list
Provides regexes for lists.
Regexp::Common::net
Provides regexes for IPv4 addresses and MAC addresses.
Regexp::Common::number
Provides regexes for numbers (integers and reals).
Regexp::Common::profanity
Provides regexes for profanity.
Regexp::Common::whitespace
Provides regexes for leading and trailing whitespace.
Regexp::Common::zip
Provides regexes for zip codes.

Forthcoming patterns and features

Future releases of the module will also provide patterns for the following:

        * email addresses 
        * HTML/XML tags
        * more numerical matchers,
        * mail headers (including multiline ones),
        * more URLS
        * telephone numbers of various countries
        * currency (universal 3 letter format, Latin-1, currency names)
        * dates
        * binary formats (e.g. UUencoded, MIMEd)
If you have other patterns or pattern generators that you think would be generally useful, please send them to the maintainer preferably as source code using the CWpattern subroutine. Submissions that include a set of tests will be especially welcome.

DIAGNOSTICS

The subroutine-based interface didn't recognize the requested subroutine. Often caused by a spelling mistake or an incompletely specified name. Regexp::Common doesn't have a generator for the requested pattern. Often indicates a mispelt or missing parameter. The requested pattern requires advanced regex features (e.g. recursion) that not available in your version of Perl. Time to upgrade. Every user-defined pattern specification must have a name. Every user-defined pattern specification must provide a pattern creation mechanism: either a pattern string or a reference to a subroutine that returns the pattern string. The CW$RE{num}{real}{-base=>'CINCW'} pattern uses the characters [0-9A-Z] to represent the digits of various bases. Hence it only produces regular expressions for bases up to hexatricensimal. The pattern has no default delimiter. You need to write: CW$RE{delimited}{-delim=>CIXCW'} for some character X

ACKNOWLEDGEMENTS

Deepest thanks to the many people who have encouraged and contributed to this project, especially: Elijah, Jarkko, Tom, Nat, Ed, and Vivek.

HISTORY

  $Log: Common.pm,v $
  Revision 2.120  2005/03/16 00:24:45  abigail
  Load Carp only on demand
  Revision 2.119  2005/01/01 16:35:14  abigail
  - Updated copyright notice. New release.
  Revision 2.118  2004/12/14 23:17:57  abigail
  Fixed the generic OO routines.
  Revision 2.117  2004/06/30 15:01:35  abigail
  Pod nits. (Jim Cromie)
  Revision 2.116  2004/06/30 09:37:36  abigail
  New version
  Revision 2.115  2004/06/09 21:58:01  abigail
  - 'SEN'
  - New release.
  Revision 2.114  2003/05/25 21:34:56  abigail
  POD nits from Bryan C. Warnock
  Revision 2.113  2003/04/02 21:23:48  abigail
  Removed anything related to $; being '='
  Revision 2.112  2003/03/25 23:27:27  abigail
  New release
  Revision 2.111  2003/03/12 22:37:13  abigail
  +  The -i switch.
  +  New release.
  Revision 2.110  2003/02/21 14:55:31  abigail
  New release
  Revision 2.109  2003/02/10 21:36:58  abigail
  New release
  Revision 2.108  2003/02/09 21:45:07  abigail
  New release
  Revision 2.107  2003/02/07 15:23:03  abigail
  New release
  Revision 2.106  2003/02/02 17:44:58  abigail
  New release
  Revision 2.105  2003/02/02 03:20:32  abigail
  New release
  Revision 2.104  2003/01/24 15:43:40  abigail
  New release
  Revision 2.103  2003/01/23 02:19:01  abigail
  New release
  Revision 2.102  2003/01/22 17:32:34  abigail
  New release
  Revision 2.101  2003/01/21 23:52:18  abigail
  POD fix.
  Revision 2.100  2003/01/21 23:19:40  abigail
  The whole world understands RCS/CVS version numbers, that 1.9 is an
  older version than 1.10. Except CPAN. Curse the idiot(s) who think
  that version numbers are floats (in which universe do floats have
  more than one decimal dot?).
  Everything is bumped to version 2.100 because CPAN couldn't deal
  with the fact one file had version 1.10.
  Revision 1.30  2003/01/17 13:19:04  abigail
  New release
  Revision 1.29  2003/01/16 11:08:41  abigail
  New release
  Revision 1.28  2003/01/01 23:03:53  abigail
  New distribution
  Revision 1.27  2003/01/01 17:09:07  abigail
  lingua class added
  Revision 1.26  2002/12/30 23:08:28  abigail
  New module Regexp::Common::zip
  Revision 1.25  2002/12/27 23:34:44  abigail
  New release
  Revision 1.24  2002/12/24 00:00:04  abigail
  New release
  Revision 1.23  2002/11/06 13:50:23  abigail
  Minor POD changes.
  Revision 1.22  2002/10/01 18:25:46  abigail
  POD buglets.
  Revision 1.21  2002/09/18 17:46:11  abigail
  POD Typo fix (Douglas Hunter)
  Revision 1.20  2002/08/27 17:04:29  abigail
  VERSION is now extracted from the CVS revision number.
  Revision 1.19  2002/08/06 14:46:49  abigail
  Upped version number to 0.09.
  Revision 1.18  2002/08/06 13:50:08  abigail
  - Added HISTORY section with CVS log.
  - Upped version number to 0.08.
  Revision 1.17  2002/08/05 12:21:46  abigail
  Upped version number to 0.07.
  Revision 1.16  2002/08/05 12:16:30  abigail
  Fixed 'Regex::' typo to 'Regexp::' (Found my Mike Castle).
  Revision 1.15  2002/08/04 22:56:02  abigail
  Upped version number to 0.06.
  Revision 1.14  2002/08/04 19:33:33  abigail
  Loaded URI by default.
  Revision 1.13  2002/08/01 10:02:42  abigail
  Upped version number.
  Revision 1.12  2002/07/31 23:26:06  abigail
  Upped version number.
  Revision 1.11  2002/07/31 13:11:20  abigail
  Removed URL from the list of default loaded regexes, as this one isn't
  ready yet.
  Upped the version number to 0.03.
  Revision 1.10  2002/07/29 13:16:38  abigail
  Introduced 'use strict' (which uncovered a bug, \@non_flags was used
  when $spec{create} was called instead of \@nonflags).
  Turned warnings on (using local $^W = 1; "use warnings" isn't available
  in pre 5.6).
  Revision 1.9  2002/07/28 23:02:54  abigail
  Split out the remaining pattern groups to separate files.
  Fixed a bug in _decache, changed the regex /$fpat=(.+)/ to
  /$fpat=(.*)/, to be able to distinguish the case of a flag
  set to the empty string, or a flag without an argument.
  Added 'undef' to @_ in the sub_interface setting to avoid a warning
  of setting a hash with an odd number of arguments.
  POD fixes.
  Revision 1.8  2002/07/25 23:55:54  abigail
  Moved balanced, net and URL to separate files.
  Revision 1.7  2002/07/25 20:01:40  abigail
  Modified import() to deal with factoring out groups of related regexes.
  Factored out comments into Common/comment.
  Revision 1.6  2002/07/23 21:20:43  abigail
  Upped version number to 0.02.
  Revision 1.5  2002/07/23 21:14:55  abigail
  Added $RE{comment}{HTML}.
  Revision 1.4  2002/07/23 17:01:09  abigail
  Added lines about new maintainer, and an email address to submit bugs
  and new regexes to.
  Revision 1.3  2002/07/23 13:58:58  abigail
  Changed various occurences of C<... => ...> into C<< ... => ... >>.
  Revision 1.2  2002/07/23 12:27:07  abigail
  Line 733 was missing the closing > of a C<> in the POD.
  Revision 1.1  2002/07/23 12:22:51  abigail
  Initial revision

AUTHOR

Damian Conway (damian@conway.org)

MAINTAINANCE

This package is maintained by Abigail (regexp-common@abigail.nl).

BUGS AND IRRITATIONS

Bound to be plenty. For a start, there are many common regexes missing. Send them in to regexp-common@abigail.nl.

COPYRIGHT

   Copyright (c) 2001 - 2005, Damian Conway and Abigail. All Rights
 Reserved. This module is free software. It may be used, redistributed
     and/or modified under the terms of the Perl Artistic License
           (see http://www.perl.com/perl/misc/Artistic.html)