Документ взят из кэша поисковой машины. Адрес оригинального документа : http://old.master.cmc.msu.ru/php/pcre.pattern.syntax.html
Дата изменения: Sun Feb 3 22:54:14 2002
Дата индексирования: Tue Oct 2 04:22:35 2012
Кодировка:
Pattern Syntax

Pattern Syntax

(unknown)

Pattern Syntax -- Describes PCRE regex syntax

Description

The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5, with just a few differences (see below). The current implementation corresponds to Perl 5.005.

Differences From Perl

The differences described here are with respect to Perl 5.005.

  1. By default, a whitespace character is any character that the C library function isspace() recognizes, though it is possible to compile PCRE with alternative character type tables. Normally isspace() matches space, formfeed, newline, carriage return, horizontal tab, and vertical tab. Perl 5 no longer includes vertical tab in its set of whitespace characters. The \v escape that was in the Perl documentation for a long time was never in fact recognized. However, the character itself was treated as whitespace at least up to 5.002. In 5.004 and 5.005 it does not match \s.

  2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits them, but they do not mean what you might think. For example, (?!a){3} does not assert that the next three characters are not "a". It just asserts that the next character is not "a" three times.

  3. Capturing subpatterns that occur inside negative looka- head assertions are counted, but their entries in the offsets vector are never set. Perl sets its numerical vari- ables from any such patterns that are matched before the assertion fails to match something (thereby succeeding), but only if the negative lookahead assertion contains just one branch.

  4. Though binary zero characters are supported in the sub- ject string, they are not allowed in a pattern string because it is passed as a normal C string, terminated by zero. The escape sequence "\0" can be used in the pattern to represent a binary zero.

  5. The following Perl escape sequences are not supported: \l, \u, \L, \U, \E, \Q. In fact these are implemented by Perl's general string-handling and are not part of its pat- tern matching engine.

  6. The Perl \G assertion is not supported as it is not relevant to single pattern matches.

  7. Fairly obviously, PCRE does not support the (?{code}) construction.

  8. There are at the time of writing some oddities in Perl 5.005_02 concerned with the settings of captured strings when part of a pattern is repeated. For example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set. In Perl 5.004 $2 is set in both cases, and that is also TRUE of PCRE. If in the future Perl changes to a consistent state that is different, PCRE may change to follow.

  9. Another as yet unresolved discrepancy is that in Perl 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string "a", whereas in PCRE it does not. However, in both Perl and PCRE /^(a)?a/ matched against "a" leaves $1 unset.

  10. PCRE provides some extensions to the Perl regular expression facilities:

    1. Although lookbehind assertions must match fixed length strings, each alternative branch of a lookbehind assertion can match a different length of string. Perl 5.005 requires them all to have the same length.

    2. If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ meta- character matches only at the very end of the string.

    3. If PCRE_EXTRA is set, a backslash followed by a letter with no special meaning is faulted.

    4. If PCRE_UNGREEDY is set, the greediness of the repeti- tion quantifiers is inverted, that is, by default they are not greedy, but if followed by a question mark they are.

Regular Expression Details

Introduction

The syntax and semantics of the regular expressions sup- ported by PCRE are described below. Regular expressions are also described in the Perl documentation and in a number of other books, some of which have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description here is intended as reference documentation. A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding charac- ters in the subject. As a trivial example, the pattern The quick brown fox matches a portion of a subject string that is identical to itself.

Meta-caracters

The power of regular expressions comes from the ability to include alternatives and repetitions in the pat- tern. These are encoded in the pattern by the use of meta- characters, which do not stand for themselves but instead are interpreted in some special way.

There are two different sets of meta-characters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized in square brackets. Outside square brackets, the meta-characters are as follows:

\

general escape character with several uses

^

assert start of subject (or line, in multiline mode)

$

assert end of subject (or line, in multiline mode)

.

match any character except newline (by default)

[

start character class definition

]

end character class definition

|

start of alternative branch

(

start subpattern

)

end subpattern

?

extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer

*

0 or more quantifier

+

1 or more quantifier

{

start min/max quantifier

}

end min/max quantifier

Part of a pattern that is in square brackets is called a "character class". In a character class the only meta- characters are:

\

general escape character

^

negate the class, but only if the first character

-

indicates character range

]

terminates the character class

The following sections describe the use of each of the meta-characters.

backslash

The backslash character has several uses. Firstly, if it is followed by a non-alphameric character, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes.

For example, if you want to match a "*" character, you write "\*" in the pattern. This applies whether or not the follow- ing character would otherwise be interpreted as a meta- character, so it is always safe to precede a non-alphameric with "\" to specify that it stands for itself. In particu- lar, if you want to match a backslash, you write "\\".

If a pattern is compiled with the PCRE_EXTENDED option, whi- tespace in the pattern (other than in a character class) and characters between a "#" outside a character class and the next newline character are ignored. An escaping backslash can be used to include a whitespace or "#" character as part of the pattern.

A second use of backslash provides a way of encoding non- printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing charac- ters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is usually easier to use one of the following escape sequences than the binary character it represents:

\a

alarm, that is, the BEL character (hex 07)

\cx

"control-x", where x is any character

\e

escape (hex 1B)

\f

formfeed (hex 0C)

\n

newline (hex 0A)

\r

carriage return (hex 0D)

\t

tab (hex 09)

\xhh

character with hex code hh

\ddd

character with octal code ddd, or backreference

The precise effect of "\cx" is as follows: if "x" is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus "\cz" becomes hex 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.

After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case).

After "\0" up to two further octal digits are read. In both cases, if there are fewer than two digits, just those that are present are used. Thus the sequence "\0\x\07" specifies two binary zeros followed by a BEL character. Make sure you supply two digits after the initial zero if the character that follows is itself an octal digit.

The handling of a backslash followed by a digit other than 0 is complicated. Outside a character class, PCRE reads it and any following digits as a decimal number. If the number is less than 10, or if there have been at least that many previous capturing left parentheses in the expression, the entire sequence is taken as a back reference. A description of how this works is given later, following the discussion of parenthesized subpatterns.

Inside a character class, or if the decimal number is greater than 9 and there have not been that many capturing subpatterns, PCRE re-reads up to three octal digits follow- ing the backslash, and generates a single byte from the least significant 8 bits of the value. Any subsequent digits stand for themselves. For example:

\040

is another way of writing a space

\40

is the same, provided there are fewer than 40 previous capturing subpatterns

\7

is always a back reference

\11

might be a back reference, or another way of writing a tab

\011

is always a tab

\0113

is a tab followed by the character "3"

\113

is the character with octal code 113 (since there can be no more than 99 back references)

\377

is a byte consisting entirely of 1 bits

\81

is either a back reference, or a binary zero followed by the two characters "8" and "1"

Note that octal values of 100 or greater must not be intro- duced by a leading zero, because no more than three octal digits are ever read.

All the sequences that define a single byte value can be used both inside and outside character classes. In addition, inside a character class, the sequence "\b" is interpreted as the backspace character (hex 08). Outside a character class it has a different meaning (see below).

The third use of backslash is for specifying generic charac- ter types:

\d

any decimal digit

\D

any character that is not a decimal digit

\s

any whitespace character

\S

any character that is not a whitespace character

\w

any "word" character

\W

any "non-word" character

Each pair of escape sequences partitions the complete set of characters into two disjoint sets. Any given character matches one, and only one, of each pair.

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place (see "Locale support" above). For example, in the "fr" (French) locale, some char- acter codes greater than 128 are used for accented letters, and these are matched by \w.

These character type sequences can appear both inside and outside character classes. They each match one character of the appropriate type. If the current matching point is at the end of the subject string, all of them fail, since there is no character to match.

The fourth use of backslash is for certain simple asser- tions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The use of subpatterns for more complicated assertions is described below. The backslashed assertions are

\b

word boundary

\B

not a word boundary

\A

start of subject (independent of multiline mode)

\Z

end of subject or newline at end (independent of multiline mode)

\z

end of subject (independent of multiline mode)

These assertions may not appear in character classes (but note that "\b" has a different meaning, namely the backspace character, inside a character class).

A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively.

The \A, \Z, and \z assertions differ from the traditional circumflex and dollar (described below) in that they only ever match at the very start and end of the subject string, whatever options are set. They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. The difference between \Z and \z is that \Z matches before a newline that is the last character of the string as well as at the end of the string, whereas \z matches only at the end.

Circumflex and dollar


     Outside a character class, in the default matching mode, the
     circumflex  character  is an assertion which is true only if
     the current matching point is at the start  of  the  subject
     string. Inside a character class, circumflex has an entirely
     different meaning (see below).

     Circumflex need not be the first character of the pattern if
     a  number of alternatives are involved, but it should be the
     first thing in each alternative in which it appears  if  the
     pattern is ever to match that branch. If all possible alter-
     natives start with a circumflex, that is, if the pattern  is
     constrained to match only at the start of the subject, it is
     said to be an "anchored" pattern. (There are also other con-
     structs that can cause a pattern to be anchored.)

     A dollar character is an assertion which is TRUE only if the
     current  matching point is at the end of the subject string,
     or immediately before a newline character that is  the  last
     character in the string (by default). Dollar need not be the
     last character of the pattern if a  number  of  alternatives
     are  involved,  but it should be the last item in any branch
     in which it appears.  Dollar has no  special  meaning  in  a
     character class.

     The meaning of dollar can be changed so that it matches only
     at   the   very   end   of   the   string,  by  setting  the
     PCRE_DOLLAR_ENDONLY  option at compile or matching time. This
     does not affect the \Z assertion.

     The meanings of the circumflex  and  dollar  characters  are
     changed  if  the  PCRE_MULTILINE  option is set. When this is
     the case,  they  match  immediately  after  and  immediately
     before an internal "\n" character, respectively, in addition
     to matching at the start and end of the subject string.  For
     example,  the  pattern  /^abc$/  matches  the subject string
     "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-
     quently,  patterns  that  are  anchored  in single line mode
     because all branches start with "^" are not anchored in mul-
     tiline  mode.  The  PCRE_DOLLAR_ENDONLY  option is ignored if
     PCRE_MULTILINE  is set.

     Note that the sequences \A, \Z, and \z can be used to  match
     the  start  and end of the subject in both modes, and if all
     branches of a pattern start with \A is it  always  anchored,
     whether PCRE_MULTILINE  is set or not.
     

FULL STOP


     Outside a character class, a dot in the pattern matches  any
     one  character  in  the  subject,  including  a non-printing
     character, but not (by default) newline.  If the PCRE_DOTALL 
     option  is  set,  then dots match newlines as well. The han-
     dling of dot is entirely independent of the handling of cir-
     cumflex  and  dollar,  the only relationship being that they
     both involve newline characters.  Dot has no special meaning
     in a character class.
     

Square brackets


     An opening square bracket introduces a character class, ter-
     minated  by  a  closing  square  bracket.  A  closing square
     bracket on its own is  not  special.  If  a  closing  square
     bracket  is  required as a member of the class, it should be
     the first data character in the class (after an initial cir-
     cumflex, if present) or escaped with a backslash.

     A character class matches a single character in the subject;
     the  character  must  be in the set of characters defined by
     the class, unless the first character in the class is a&