If you’ve ever talked about regular expressions with another programmer, you’ve invariably heard:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
The quote is from Jamie Zawinski, a world class hacker. [But in that same discussion he also said]: “Perl’s nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most ‘obvious’ (at least, to people who don’t know any better) way to get from point A to point B.
Here’s the point Jamie was trying to make: not that regular expressions are evil, per se, but that overuse of regular expressions is evil… Regular expressions are like a particularly spicy hot sauce, to be used in moderation and with restraint only when appropriate. If you drench your plate in hot sauce, you’re going to be very, very sorry later. In the same way that I can’t imagine food without a dash of hot sauce now and then, I can’t imagine programming without an occasional regular expression. It’d be a bland, unsatisfying experience.
- Jeff Atwood
Regular expressions were originally defined for theoretical reasons, but they have turned out to be quite handy in day-to-day programming, working on strings of ASCII or Unicode characters.
Working exclusively with concatenation, union, and star can sometimes be a bit awkward, so there a number of conventional abbreviations used when programming with regular expressions:
| Abbreviation | Meaning | |
|---|---|---|
R? |
(R\|ε) |
|
R+ |
RR* |
|
[Aabcdfghiou] |
(A\|a\|b\|c\|d\|f\|g\|h\|i\|o\|u) |
|
[Aa-df-iou] |
(same as above) | |
[^Aabcdfghiou] |
Any single character other than A,a,b,c,d,f,g,h,i,o,u. | |
[^Aa-df-iou] |
(same as above) | |
. |
Any single character (except newline, usually). | |
R{3,5} |
RRR\|RRRR\|RRRRR |
|
\d |
Any digit character | |
\s |
Any whitespace character (space, tab, newline, etc.) | |
^ |
Start of input (or start of line) | |
$ |
End of input (or end of line) |
Example
The RX library used this regular expression in one of its unit tests:
>
M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]
> That is, there must be M, then either o
or u, then optionally an apostrophe, then one or more
m’s, and so on. Examples of strings matchine this regular
expression include: > Muammar Qaddafi
Mo'ammar Gadhafi
Muammar Kaddafi
Muammar Qadhafi
Moammar El Kadhafi
Muammar Gadafi
Mu'ammar al-Qadafi
Moamer El Kazzafi
Moamar al-Gaddafi
Mu'ammar Al Qathafi
Muammar Al Qathafi
Mo'ammar el-Gadhafi
Moamar El Kadhafi
Muammar al-Qadhafi
Mu'ammar al-Qadhdhafi
Mu'ammar Qadafi
Moamar Gaddafi
Mu'ammar Qadhdhafi
Muammar al-Khaddafi
Mu'amar al-Kadafi
Muammar Ghaddafy
Muammar Ghadafi
Muammar Ghaddafi
Muamar Kaddafi
Muammar Quathafi
Muammar Gheddafi
Muamar Al-Kaddafi
Moammar Khadafy
Moammar Qudhafi
Mu'ammar al-Qaddafi
Mu'ammar Muhammad Abu Minyar al-Qadhafi
The above example is a regular expression that matches all plausible
transliterations of the name of a former Libyan dictator from Arabic
into the English (Roman) alphabet. It also matches some very implausible
transliterations (e.g., where the m+ part of the regular
expression corresponds to 17 m’s). But if we are searching
a big database of newspaper and magazine articles, it’s fine to use an
over-inclusive regular expression because that ensures we’ll catch all
relevant articles.
On the other hand, there are many situations where we want a regular expression that precisely matches the strings we are interested in.
Example
Suppose we want a regular expression that matches dates in November.
> >1. What’s wrong with [November|Nov.] [1-30]?
>2. What should we write instead? > >— > >1. Recall that
characters in square brackets mean “any single character among those
listed.” So [November|Nov.] [1-30] is just another way of
writing (N|o|v|e|m|b|e|r|\||N|o|v|\.) (1|2|3|0) and matches
input strings like v 2 or m 1.
> Inside brackets, all characters except ^ and - always stand for
themselves. Outside brackets, we often need to say \| to
mean the literal bar character, or \. to mean a literal
period. > >2. What should we write instead? > >
(November|Nov\.) ([1-9]|[12][0-9]|30)
Example
Here is a regular expression for legal C identifiers (variable
names), which must begin with a letter or underscore, and can contain
further letters, digits, and underscores (e.g., main or
__Z6rotateiii) >
>[a-zA-Z_][a-zA-Z0-9_]*
Example
Here is a regular expression for legal Ada identifiers (variable
names), which must begin with a letter, and can contain further letters,
digits, and underscores, but cannot have two underscores in a row and
cannot end in an underscore (e.g., count2 or
first_nonzero_entry): >
>[a-zA-Z]([a-zA-Z0-9]|_[a-zA-Z0-9])* > Another
possible solution would be: >
>[a-zA-Z](_?[a-zA-Z0-9]+)*
Example
The first standard for the C language (ANSI C) described the legal
integer constants as follows: > > An integer constant consisting
of a sequence of digits is taken to be octal if it begins with 0 (digit
zero), decimal otherwise. Octal constants do not contain the digits 8 or
9. A sequence of digits preceded by 0x or 0X (digit zero) is taken to be
a hexadecimal > integer. The hexadecimal digits include a or A
through £ or F with values 10 through 15.
>
> An integer constant may be suffixed by the letter u or U, to
specify that it is unsigned. It may also be suffixed by the letter l or
L to specify that it is long. > This prose description is arguably
ambiguous. Can the L suffix come before the U suffix? When it refers to
a sequence of digits preceded by 0X, can this sequence be an empty
sequence? > >We can make the description unambiguous by giving a
regular expression, e.g., >
>([1-9][0-9]+ | 0[0-7]* | 0[Xx][0-9a-fA-F]+)
>([Ll][Uu]?|[Uu][Ll]?)? > (There are many other
equivalent regular expressions as well.) ## Warning 1: Syntax
Variations
Different applications may use different syntaxes, e.g., which of
| and \| means the literal bar character, and
which means the regular-expression alternative operation.
For example, if we want a regular expression that matches the words
bear, bead, bar, and
bad, we must write:
b(ea|a)(r|d) in Perl or VS Codeb\(ea|a\)\(r|d\) in the tool sedb\(ea\|a\)\(r\|d\) in the editor EmacsThe moral is that when you want to use regular expressions in a new context, you have to learn what syntax it expects when writing regular expressions.
The command line in most operating systems uses syntax that looks somewhat like regular expressions, but is slightly different (and more restrictive). The common name for this command-line syntax is “globbing”.
Example
Suppose we have a directory with the following files: >
>> > ls >bad bag bar bead beg bear beer bug ear rag rear rug >
> >Then: >
>>> ls b*r >bar bear beer > >(A star in a
glob corresponds to .* in a regular expression.) >
>>> ls b?ar >bear > >(A question mark in
a glob corresponds to .? in a regular expression.) >
>>> ls [br]ear` >bear rear > >(Bracket
notation has the same meaning.) >
>>> ls b{ea,a}{r,d} >bad bar bead bear >
>(Globs use a list of items in curly braces instead of
|.) >
>>> `ls {?,r?}ar` >bar ear rear > >(We
can nest glob notation.) >
>>> ls b.g >ls: b.g: No such file or directory: >
>(In globs, a period stands for the period character.)
Previous: 6.6 Closure Properties