-
Notifications
You must be signed in to change notification settings - Fork 788
/
Copy pathpattern.syntax.xml
2406 lines (2273 loc) · 90 KB
/
pattern.syntax.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<!-- $Revision$ -->
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
<title>Pattern Syntax</title>
<titleabbrev>PCRE regex syntax</titleabbrev>
<section xml:id="regexp.introduction">
<title>Introduction</title>
<para>
The syntax and semantics of the regular expressions
supported by PCRE are described in this section. Regular expressions are
also described in the Perl documentation and in a number of
other books, some of which have copious examples. Jeffrey
Friedl's "Mastering Regular Expressions", published by
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
The description here is intended as reference documentation.
</para>
<para>
A regular expression is a pattern that is matched against a
subject string from left to right. Most characters stand for
themselves in a pattern, and match the corresponding
characters in the subject. As a trivial example, the pattern
<literal>The quick brown fox</literal>
matches a portion of a subject string that is identical to
itself.
</para>
</section>
<section xml:id="regexp.reference.delimiters">
<title>Delimiters</title>
<para>
When using the PCRE functions, it is required that the pattern is enclosed
by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,
non-backslash, non-whitespace character.
Leading whitespace before a valid delimiter is silently ignored.
</para>
<para>
Often used delimiters are forward slashes (<literal>/</literal>), hash
signs (<literal>#</literal>) and tildes (<literal>~</literal>). The
following are all examples of valid delimited patterns.
<informalexample>
<programlisting>
<![CDATA[
/foo bar/
#^[^0-9]$#
+php+
%[a-zA-Z0-9_-]%
]]>
</programlisting>
</informalexample>
</para>
<para>
It is also possible to use
bracket style delimiters where the opening and closing brackets are the
starting and ending delimiter, respectively. <literal>()</literal>,
<literal>{}</literal>, <literal>[]</literal> and <literal><></literal>
are all valid bracket style delimiter pairs.
<informalexample>
<programlisting>
<![CDATA[
(this [is] a (pattern))
{this [is] a (pattern)}
[this [is] a (pattern)]
<this [is] a (pattern)>
]]>
</programlisting>
</informalexample>
Bracket style delimiters do not need to be escaped when they are used as meta
characters within the pattern, but as with other delimiters they must be
escaped when they are used as literal characters.
</para>
<para>
If the delimiter needs to be matched inside the pattern it must be
escaped using a backslash. If the delimiter appears often inside the
pattern, it is a good idea to choose another delimiter in order to increase
readability.
<informalexample>
<programlisting>
<![CDATA[
/http:\/\//
#http://#
]]>
</programlisting>
</informalexample>
The <function>preg_quote</function> function may be used to escape a string
for injection into a pattern and its optional second parameter may be used
to specify the delimiter to be escaped.
</para>
<para>
You may add <link linkend="reference.pcre.pattern.modifiers">pattern
modifiers</link> after the ending delimiter. The following is an example
of case-insensitive matching:
<informalexample>
<programlisting>
<![CDATA[
#[a-z]#i
]]>
</programlisting>
</informalexample>
</para>
</section>
<section xml:id="regexp.reference.meta">
<title>Meta-characters</title>
<para>
The power of regular expressions comes from the
ability to include alternatives and repetitions in the
pattern. These are encoded in the pattern by the use of
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
are interpreted in some special way.
</para>
<para>
There are two different sets of meta-characters: those that
are recognized anywhere in the pattern except within square
brackets, and those that are recognized in square brackets.
Outside square brackets, the meta-characters are as follows:
<table>
<title>Meta-characters outside square brackets</title>
<tgroup cols="2">
<thead>
<row>
<entry>Meta-character</entry><entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>\</entry><entry>general escape character with several uses</entry>
</row>
<row>
<entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>
</row>
<row>
<entry>$</entry><entry>assert end of subject or before a terminating newline (or
end of line, in multiline mode)</entry>
</row>
<row>
<entry>.</entry><entry>match any character except newline (by default)</entry>
</row>
<row>
<entry>[</entry><entry>start character class definition</entry>
</row>
<row>
<entry>]</entry><entry>end character class definition</entry>
</row>
<row>
<entry>|</entry><entry>start of alternative branch</entry>
</row>
<row>
<entry>(</entry><entry>start subpattern</entry>
</row>
<row>
<entry>)</entry><entry>end subpattern</entry>
</row>
<row>
<entry>?</entry><entry>extends the meaning of (, also 0 or 1 quantifier, also makes greedy
quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)</entry>
</row>
<row>
<entry>*</entry><entry>0 or more quantifier</entry>
</row>
<row>
<entry>+</entry><entry>1 or more quantifier</entry>
</row>
<row>
<entry>{</entry><entry>start min/max quantifier</entry>
</row>
<row>
<entry>}</entry><entry>end min/max quantifier</entry>
</row>
</tbody>
</tgroup>
</table>
Part of a pattern that is in square brackets is called a
<link linkend="regexp.reference.character-classes">character class</link>. In a character class the only
meta-characters are:
<table>
<title>Meta-characters inside square brackets (<emphasis>character classes</emphasis>)</title>
<tgroup cols="2">
<thead>
<row>
<entry>Meta-character</entry><entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>\</entry><entry>general escape character</entry>
</row>
<row>
<entry>^</entry><entry>negate the class, but only if the first character</entry>
</row>
<row>
<entry>-</entry><entry>indicates character range</entry>
</row>
</tbody>
</tgroup>
</table>
The following sections describe the use of each of the
meta-characters.
</para>
</section>
<section xml:id="regexp.reference.escape">
<title>Escape sequences</title>
<para>
The backslash character has several uses. Firstly, if it is
followed by a non-alphanumeric character, it takes away any
special meaning that character may have. This use of
backslash as an escape character applies both inside and
outside character classes.
</para>
<para>
For example, if you want to match a "*" character, you write
"\*" in the pattern. This applies whether or not the
following character would otherwise be interpreted as a
meta-character, so it is always safe to precede a non-alphanumeric
with "\" to specify that it stands for itself. In
particular, if you want to match a backslash, you write "\\".
</para>
<note>
<para>
Single and double quoted PHP <link
linkend="language.types.string.syntax">strings</link> have special
meaning of backslash. Thus if \ has to be matched with a regular
expression \\, then "\\\\" or '\\\\' must be used in PHP code.
</para>
</note>
<para>
If a pattern is compiled with the
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option,
whitespace in the pattern (other than in a character class) and
characters between a "#" outside a character class and the next newline
character are ignored. An escaping backslash can be used to include a
whitespace or "#" character as part of the pattern.
</para>
<para>
A second use of backslash provides a way of encoding
non-printing characters in patterns in a visible manner. There
is no restriction on the appearance of non-printing characters,
apart from the binary zero that terminates a pattern,
but when a pattern is being prepared by text editing, it is
usually easier to use one of the following escape sequences
than the binary character it represents:
</para>
<para>
<variablelist>
<varlistentry>
<term><emphasis>\a</emphasis></term>
<listitem>
<simpara>alarm, that is, the BEL character (hex 07)</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\cx</emphasis></term>
<listitem>
<simpara>"control-x", where x is any character</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\e</emphasis></term>
<listitem>
<simpara>escape (hex 1B)</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\f</emphasis></term>
<listitem>
<simpara>formfeed (hex 0C)</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\n</emphasis></term>
<listitem>
<simpara>newline (hex 0A)</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\p{xx}</emphasis></term>
<listitem>
<simpara>
a character with the xx property, see
<link linkend="regexp.reference.unicode">unicode properties</link>
for more info
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\P{xx}</emphasis></term>
<listitem>
<simpara>
a character without the xx property, see
<link linkend="regexp.reference.unicode">unicode properties</link>
for more info
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\r</emphasis></term>
<listitem>
<simpara>carriage return (hex 0D)</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\R</emphasis></term>
<listitem>
<simpara>line break: matches \n, \r and \r\n</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\t</emphasis></term>
<listitem>
<simpara>tab (hex 09)</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\xhh</emphasis></term>
<listitem>
<simpara>
character with hex code hh
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\ddd</emphasis></term>
<listitem>
<simpara>character with octal code ddd, or backreference</simpara>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
The precise effect of "<literal>\cx</literal>" is as follows:
if "<literal>x</literal>" is a lower case letter, it is converted
to upper case. Then bit 6 of the character (hex 40) is inverted.
Thus "<literal>\cz</literal>" becomes hex 1A, but
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
becomes hex 7B.
</para>
<para>
After "<literal>\x</literal>", up to two hexadecimal digits are
read (letters can be in upper or lower case).
In <emphasis>UTF-8 mode</emphasis>, "<literal>\x{...}</literal>" is
allowed, where the contents of the braces is a string of hexadecimal
digits. It is interpreted as a UTF-8 character whose code number is the
given hexadecimal number. The original hexadecimal escape sequence,
<literal>\xhh</literal>, matches a two-byte UTF-8 character if the value
is greater than 127.
</para>
<para>
After "<literal>\0</literal>" up to two further octal digits are read.
In both cases, if there are fewer than two digits, just those that
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
specifies two binary zeros followed by a BEL character. Make sure you
supply two digits after the initial zero if the character
that follows is itself an octal digit.
</para>
<para>
The handling of a backslash followed by a digit other than 0
is complicated. Outside a character class, PCRE reads it
and any following digits as a decimal number. If the number
is less than 10, or if there have been at least that many
previous capturing left parentheses in the expression, the
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
of how this works is given later, following the discussion
of parenthesized subpatterns.
</para>
<para>
Inside a character class, or if the decimal number is
greater than 9 and there have not been that many capturing
subpatterns, PCRE re-reads up to three octal digits following
the backslash, and generates a single byte from the
least significant 8 bits of the value. Any subsequent digits
stand for themselves. For example:
</para>
<para>
<variablelist>
<varlistentry>
<term><emphasis>\040</emphasis></term>
<listitem><simpara>is another way of writing a space</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\40</emphasis></term>
<listitem>
<simpara>
is the same, provided there are fewer than 40
previous capturing subpatterns
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\7</emphasis></term>
<listitem><simpara>is always a back reference</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\11</emphasis></term>
<listitem>
<simpara>
might be a back reference, or another way of
writing a tab
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\011</emphasis></term>
<listitem><simpara>is always a tab</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\0113</emphasis></term>
<listitem><simpara>is a tab followed by the character "3"</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\113</emphasis></term>
<listitem>
<simpara>
is the character with octal code 113 (since there
can be no more than 99 back references)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\377</emphasis></term>
<listitem><simpara>is a byte consisting entirely of 1 bits</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\81</emphasis></term>
<listitem>
<simpara>
is either a back reference, or a binary zero
followed by the two characters "8" and "1"
</simpara>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
Note that octal values of 100 or greater must not be
introduced by a leading zero, because no more than three octal
digits are ever read.
</para>
<para>
All the sequences that define a single byte value can be
used both inside and outside character classes. In addition,
inside a character class, the sequence "<literal>\b</literal>"
is interpreted as the backspace character (hex 08). Outside a character
class it has a different meaning (see below).
</para>
<para>
The third use of backslash is for specifying generic
character types:
</para>
<para>
<variablelist>
<varlistentry>
<term><emphasis>\d</emphasis></term>
<listitem><simpara>any decimal digit</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\D</emphasis></term>
<listitem><simpara>any character that is not a decimal digit</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\h</emphasis></term>
<listitem><simpara>any horizontal whitespace character</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\H</emphasis></term>
<listitem><simpara>any character that is not a horizontal whitespace character</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\s</emphasis></term>
<listitem><simpara>any whitespace character</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\S</emphasis></term>
<listitem><simpara>any character that is not a whitespace character</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\v</emphasis></term>
<listitem><simpara>any vertical whitespace character</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\V</emphasis></term>
<listitem><simpara>any character that is not a vertical whitespace character</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\w</emphasis></term>
<listitem><simpara>any "word" character</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\W</emphasis></term>
<listitem><simpara>any "non-word" character</simpara></listitem>
</varlistentry>
</variablelist>
</para>
<para>
Each pair of escape sequences partitions the complete set of
characters into two disjoint sets. Any given character
matches one, and only one, of each pair.
</para>
<para>
The "whitespace" characters are HT (9), LF (10), FF (12), CR (13),
and space (32). However, if locale-specific matching is happening,
characters with code points in the range 128-255 may also be considered
as whitespace characters, for instance, NBSP (A0).
</para>
<para>
A "word" character is any letter or digit or the underscore
character, that is, any character which can be part of a
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
controlled by PCRE's character tables, and may vary if locale-specific
matching is taking place. For example, in the "fr" (French) locale, some
character codes greater than 128 are used for accented letters,
and these are matched by <literal>\w</literal>.
</para>
<para>
These character type sequences can appear both inside and
outside character classes. They each match one character of
the appropriate type. If the current matching point is at
the end of the subject string, all of them fail, since there
is no character to match.
</para>
<para>
The fourth use of backslash is for certain simple
assertions. An assertion specifies a condition that has to be met
at a particular point in a match, without consuming any
characters from the subject string. The use of subpatterns
for more complicated assertions is described below. The
backslashed assertions are
</para>
<para>
<variablelist>
<varlistentry>
<term><emphasis>\b</emphasis></term>
<listitem><simpara>word boundary</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\B</emphasis></term>
<listitem><simpara>not a word boundary</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\A</emphasis></term>
<listitem><simpara>start of subject (independent of multiline mode)</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\Z</emphasis></term>
<listitem>
<simpara>
end of subject or newline at end (independent of
multiline mode)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\z</emphasis></term>
<listitem><simpara>end of subject (independent of multiline mode)</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\G</emphasis></term>
<listitem><simpara>first matching position in subject</simpara></listitem>
</varlistentry>
</variablelist>
</para>
<para>
These assertions may not appear in character classes (but
note that "<literal>\b</literal>" has a different meaning, namely the backspace
character, inside a character class).
</para>
<para>
A word boundary is a position in the subject string where
the current character and the previous character do not both
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
<literal>\w</literal> and the other matches
<literal>\W</literal>), or the start or end of the string if the first
or last character matches <literal>\w</literal>, respectively.
</para>
<para>
The <literal>\A</literal>, <literal>\Z</literal>, and
<literal>\z</literal> assertions differ from the traditional
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )
in that they only ever match at the very start and end of the subject string,
whatever options are set. They are not affected by the
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
options. The difference between <literal>\Z</literal> and
<literal>\z</literal> is that <literal>\Z</literal> matches before a
newline that is the last character of the string as well as at the end of
the string, whereas <literal>\z</literal> matches only at the end.
</para>
<para>
The <literal>\G</literal> assertion is true only when the current
matching position is at the start point of the match, as specified by
the <parameter>offset</parameter> argument of
<function>preg_match</function>. It differs from <literal>\A</literal>
when the value of <parameter>offset</parameter> is non-zero.
</para>
<para>
<literal>\Q</literal> and <literal>\E</literal> can be used to ignore
regexp metacharacters in the pattern. For example:
<literal>\w+\Q.$.\E$</literal> will match one or more word characters,
followed by literals <literal>.$.</literal> and anchored at the end of
the string. Note that this does not change the behavior of
delimiters; for instance the pattern <literal>#\Q#\E#$</literal>
is not valid, because the second <literal>#</literal> marks the end
of the pattern, and the <literal>\E#</literal> is interpreted as invalid
modifiers.
</para>
<para>
<literal>\K</literal> can be used to reset the match start.
For example, the pattern <literal>foo\Kbar</literal> matches
"foobar", but reports that it has matched "bar". The use of
<literal>\K</literal> does not interfere with the setting of captured
substrings. For example, when the pattern <literal>(foo)\Kbar</literal>
matches "foobar", the first substring is still set to "foo".
</para>
</section>
<section xml:id="regexp.reference.unicode">
<title>Unicode character properties</title>
<para>
Since 5.1.0, three
additional escape sequences to match generic character types are available
when <emphasis>UTF-8 mode</emphasis> is selected. They are:
</para>
<variablelist>
<varlistentry>
<term><emphasis>\p{xx}</emphasis></term>
<listitem><simpara>a character with the xx property</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\P{xx}</emphasis></term>
<listitem><simpara>a character without the xx property</simpara></listitem>
</varlistentry>
<varlistentry>
<term><emphasis>\X</emphasis></term>
<listitem><simpara>an extended Unicode sequence</simpara></listitem>
</varlistentry>
</variablelist>
<para>
The property names represented by <literal>xx</literal> above are limited
to the Unicode general category properties. Each character has exactly one
such property, specified by a two-letter abbreviation. For compatibility with
Perl, negation can be specified by including a circumflex between the
opening brace and the property name. For example, <literal>\p{^Lu}</literal>
is the same as <literal>\P{Lu}</literal>.
</para>
<para>
If only one letter is specified with <literal>\p</literal> or
<literal>\P</literal>, it includes all the properties that start with that
letter. In this case, in the absence of negation, the curly brackets in the
escape sequence are optional; these two examples have the same effect:
</para>
<informalexample>
<programlisting>
<![CDATA[
\p{L}
\pL
]]>
</programlisting>
</informalexample>
<table>
<title>Supported property codes</title>
<tgroup cols="3">
<thead>
<row>
<entry>Property</entry>
<entry>Matches</entry>
<entry>Notes</entry>
</row>
</thead>
<tbody>
<row>
<entry><literal>C</literal></entry>
<entry>Other</entry>
<entry></entry>
</row>
<row>
<entry><literal>Cc</literal></entry>
<entry>Control</entry>
<entry></entry>
</row>
<row>
<entry><literal>Cf</literal></entry>
<entry>Format</entry>
<entry></entry>
</row>
<row>
<entry><literal>Cn</literal></entry>
<entry>Unassigned</entry>
<entry></entry>
</row>
<row>
<entry><literal>Co</literal></entry>
<entry>Private use</entry>
<entry></entry>
</row>
<row rowsep="1">
<entry><literal>Cs</literal></entry>
<entry>Surrogate</entry>
<entry></entry>
</row>
<row>
<entry><literal>L</literal></entry>
<entry>Letter</entry>
<entry>
Includes the following properties: <literal>Ll</literal>,
<literal>Lm</literal>, <literal>Lo</literal>, <literal>Lt</literal> and
<literal>Lu</literal>.
</entry>
</row>
<row>
<entry><literal>Ll</literal></entry>
<entry>Lower case letter</entry>
<entry></entry>
</row>
<row>
<entry><literal>Lm</literal></entry>
<entry>Modifier letter</entry>
<entry></entry>
</row>
<row>
<entry><literal>Lo</literal></entry>
<entry>Other letter</entry>
<entry></entry>
</row>
<row>
<entry><literal>Lt</literal></entry>
<entry>Title case letter</entry>
<entry></entry>
</row>
<row rowsep="1">
<entry><literal>Lu</literal></entry>
<entry>Upper case letter</entry>
<entry></entry>
</row>
<row>
<entry><literal>M</literal></entry>
<entry>Mark</entry>
<entry></entry>
</row>
<row>
<entry><literal>Mc</literal></entry>
<entry>Spacing mark</entry>
<entry></entry>
</row>
<row>
<entry><literal>Me</literal></entry>
<entry>Enclosing mark</entry>
<entry></entry>
</row>
<row rowsep="1">
<entry><literal>Mn</literal></entry>
<entry>Non-spacing mark</entry>
<entry></entry>
</row>
<row>
<entry><literal>N</literal></entry>
<entry>Number</entry>
<entry></entry>
</row>
<row>
<entry><literal>Nd</literal></entry>
<entry>Decimal number</entry>
<entry></entry>
</row>
<row>
<entry><literal>Nl</literal></entry>
<entry>Letter number</entry>
<entry></entry>
</row>
<row rowsep="1">
<entry><literal>No</literal></entry>
<entry>Other number</entry>
<entry></entry>
</row>
<row>
<entry><literal>P</literal></entry>
<entry>Punctuation</entry>
<entry></entry>
</row>
<row>
<entry><literal>Pc</literal></entry>
<entry>Connector punctuation</entry>
<entry></entry>
</row>
<row>
<entry><literal>Pd</literal></entry>
<entry>Dash punctuation</entry>
<entry></entry>
</row>
<row>
<entry><literal>Pe</literal></entry>
<entry>Close punctuation</entry>
<entry></entry>
</row>
<row>
<entry><literal>Pf</literal></entry>
<entry>Final punctuation</entry>
<entry></entry>
</row>
<row>
<entry><literal>Pi</literal></entry>
<entry>Initial punctuation</entry>
<entry></entry>
</row>
<row>
<entry><literal>Po</literal></entry>
<entry>Other punctuation</entry>
<entry></entry>
</row>
<row rowsep="1">
<entry><literal>Ps</literal></entry>
<entry>Open punctuation</entry>
<entry></entry>
</row>
<row>
<entry><literal>S</literal></entry>
<entry>Symbol</entry>
<entry></entry>
</row>
<row>
<entry><literal>Sc</literal></entry>
<entry>Currency symbol</entry>
<entry></entry>
</row>
<row>
<entry><literal>Sk</literal></entry>
<entry>Modifier symbol</entry>
<entry></entry>
</row>
<row>
<entry><literal>Sm</literal></entry>
<entry>Mathematical symbol</entry>
<entry></entry>
</row>
<row rowsep="1">
<entry><literal>So</literal></entry>
<entry>Other symbol</entry>
<entry>Includes emojis</entry>
</row>
<row>
<entry><literal>Z</literal></entry>
<entry>Separator</entry>
<entry></entry>
</row>
<row>
<entry><literal>Zl</literal></entry>
<entry>Line separator</entry>
<entry></entry>
</row>
<row>
<entry><literal>Zp</literal></entry>
<entry>Paragraph separator</entry>
<entry></entry>
</row>
<row>
<entry><literal>Zs</literal></entry>
<entry>Space separator</entry>
<entry></entry>
</row>
</tbody>
</tgroup>
</table>
<para>
Extended properties such as <literal>InMusicalSymbols</literal> are not
supported by PCRE.
</para>
<para>
Specifying case-insensitive (caseless) matching does not affect these escape sequences.
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
</para>
<para>
Sets of Unicode characters are defined as belonging to certain scripts. A
character from one of these sets can be matched using a script name. For
example:
</para>
<itemizedlist>
<listitem>
<simpara><literal>\p{Greek}</literal></simpara>
</listitem>
<listitem>
<simpara><literal>\P{Han}</literal></simpara>
</listitem>
</itemizedlist>
<para>
Those that are not part of an identified script are lumped together as
<literal>Common</literal>. The current list of scripts is:
</para>
<table>
<title>Supported scripts</title>
<tgroup cols="5">
<tbody>
<row>
<entry><literal>Arabic</literal></entry>
<entry><literal>Armenian</literal></entry>
<entry><literal>Avestan</literal></entry>
<entry><literal>Balinese</literal></entry>
<entry><literal>Bamum</literal></entry>
</row>
<row>
<entry><literal>Batak</literal></entry>
<entry><literal>Bengali</literal></entry>
<entry><literal>Bopomofo</literal></entry>
<entry><literal>Brahmi</literal></entry>
<entry><literal>Braille</literal></entry>
</row>
<row>
<entry><literal>Buginese</literal></entry>
<entry><literal>Buhid</literal></entry>
<entry><literal>Canadian_Aboriginal</literal></entry>
<entry><literal>Carian</literal></entry>
<entry><literal>Chakma</literal></entry>
</row>
<row>
<entry><literal>Cham</literal></entry>
<entry><literal>Cherokee</literal></entry>
<entry><literal>Common</literal></entry>
<entry><literal>Coptic</literal></entry>
<entry><literal>Cuneiform</literal></entry>
</row>
<row>
<entry><literal>Cypriot</literal></entry>
<entry><literal>Cyrillic</literal></entry>
<entry><literal>Deseret</literal></entry>
<entry><literal>Devanagari</literal></entry>
<entry><literal>Egyptian_Hieroglyphs</literal></entry>
</row>
<row>
<entry><literal>Ethiopic</literal></entry>
<entry><literal>Georgian</literal></entry>
<entry><literal>Glagolitic</literal></entry>
<entry><literal>Gothic</literal></entry>
<entry><literal>Greek</literal></entry>
</row>
<row>
<entry><literal>Gujarati</literal></entry>
<entry><literal>Gurmukhi</literal></entry>
<entry><literal>Han</literal></entry>
<entry><literal>Hangul</literal></entry>
<entry><literal>Hanunoo</literal></entry>
</row>
<row>
<entry><literal>Hebrew</literal></entry>
<entry><literal>Hiragana</literal></entry>
<entry><literal>Imperial_Aramaic</literal></entry>
<entry><literal>Inherited</literal></entry>
<entry><literal>Inscriptional_Pahlavi</literal></entry>
</row>
<row>
<entry><literal>Inscriptional_Parthian</literal></entry>
<entry><literal>Javanese</literal></entry>
<entry><literal>Kaithi</literal></entry>
<entry><literal>Kannada</literal></entry>
<entry><literal>Katakana</literal></entry>
</row>
<row>
<entry><literal>Kayah_Li</literal></entry>
<entry><literal>Kharoshthi</literal></entry>
<entry><literal>Khmer</literal></entry>
<entry><literal>Lao</literal></entry>
<entry><literal>Latin</literal></entry>
</row>
<row>
<entry><literal>Lepcha</literal></entry>
<entry><literal>Limbu</literal></entry>
<entry><literal>Linear_B</literal></entry>
<entry><literal>Lisu</literal></entry>
<entry><literal>Lycian</literal></entry>
</row>
<row>
<entry><literal>Lydian</literal></entry>
<entry><literal>Malayalam</literal></entry>
<entry><literal>Mandaic</literal></entry>
<entry><literal>Meetei_Mayek</literal></entry>
<entry><literal>Meroitic_Cursive</literal></entry>
</row>
<row>
<entry><literal>Meroitic_Hieroglyphs</literal></entry>
<entry><literal>Miao</literal></entry>
<entry><literal>Mongolian</literal></entry>
<entry><literal>Myanmar</literal></entry>
<entry><literal>New_Tai_Lue</literal></entry>
</row>
<row>
<entry><literal>Nko</literal></entry>
<entry><literal>Ogham</literal></entry>
<entry><literal>Old_Italic</literal></entry>
<entry><literal>Old_Persian</literal></entry>
<entry><literal>Old_South_Arabian</literal></entry>
</row>
<row>
<entry><literal>Old_Turkic</literal></entry>
<entry><literal>Ol_Chiki</literal></entry>
<entry><literal>Oriya</literal></entry>
<entry><literal>Osmanya</literal></entry>
<entry><literal>Phags_Pa</literal></entry>