If you’ve ever talked about regular expressions with another programmer, you’ve invariably heard:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
The quote is from Jamie Zawinski, a world class hacker. [But in that same discussion he also said]: “Perl’s nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most ‘obvious’ (at least, to people who don’t know any better) way to get from point A to point B.
Here’s the point Jamie was trying to make: not that regular expressions are evil, per se, but that overuse of regular expressions is evil… Regular expressions are like a particularly spicy hot sauce, to be used in moderation and with restraint only when appropriate. If you drench your plate in hot sauce, you’re going to be very, very sorry later. In the same way that I can’t imagine food without a dash of hot sauce now and then, I can’t imagine programming without an occasional regular expression. It’d be a bland, unsatisfying experience.—Jeff Atwood
Regular expressions were originally defined for theoretical reasons, but they have turned out to be quite handy in day-to-day programming, working on strings of ASCII or Unicode characters.
Working exclusively with concatenation, union, and star can sometimes be a bit awkward, so there a number of conventional abbreviations used when programming with regular expressions:
| Abbreviation | Meaning |
|---|---|
R? | (R|ε) |
R+ | RR* |
[Aabcdfghiou] | (A|a|b|c|d|f|g|h|i|o|u) |
[Aa-df-iou] | (same as above) |
[^Aabcdfghiou] | Any single character other than A,a,b,c,d,f,g,h,i,o,u. |
[^Aa-df-iou] | (same as above) |
. | Any single character (except newline, usually). |
R{3,5} | RRR|RRRR|RRRRR |
\d | Any digit character |
\s | Any whitespace character (space, tab, newline, etc.) |
^ | Start of input (or start of line) |
$ | End of input (or end of line) |
The RX library used this regular expression in one of its unit tests:
M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]
That is, there must be M, then either o or u, then
optionally an apostrophe, then one or more m’s, and so on.
Examples of strings matchine this regular expression include:
Muammar Qaddafi
Mo'ammar Gadhafi
Muammar Kaddafi
Muammar Qadhafi
Moammar El Kadhafi
Muammar Gadafi
Mu'ammar al-Qadafi
Moamer El Kazzafi
Moamar al-Gaddafi
Mu'ammar Al Qathafi
Muammar Al Qathafi
Mo'ammar el-Gadhafi
Moamar El Kadhafi
Muammar al-Qadhafi
Mu'ammar al-Qadhdhafi
Mu'ammar Qadafi
Moamar Gaddafi
Mu'ammar Qadhdhafi
Muammar al-Khaddafi
Mu'amar al-Kadafi
Muammar Ghaddafy
Muammar Ghadafi
Muammar Ghaddafi
Muamar Kaddafi
Muammar Quathafi
Muammar Gheddafi
Muamar Al-Kaddafi
Moammar Khadafy
Moammar Qudhafi
Mu'ammar al-Qaddafi
Mu'ammar Muhammad Abu Minyar al-Qadhafi
The above example is a regular expression that matches all plausible
transliterations of the name of a former Libyan dictator from Arabic
into the English (Roman) alphabet. It also matches some very implausible
transliterations (e.g., where the m+ part of the regular expression
corresponds to 17 m’s). But if we are searching a big database of
newspaper and magazine articles, it’s fine to use an overinclusive
regular expression because that ensures we’ll catch all relevant
articles.
On the other hand, there are many situations where we want a regular expression that precisely matches the strings we are interested in.
Suppose we want a regular expression that matches dates in November.
[November|Nov.] [1-30]?Recall that characters in square brackets mean “any single character
among those listed.” So [November|Nov.] [1-30] is just
another way of writing
(N|o|v|e|m|b|e|r|\||N|o|v|\.) (1|2|3|0) and matches input strings
like v 2 or m 1.
Inside brackets, all characters except ^ and - always stand for
themselves. Outside brackets, we often need to say \| to mean the
literal bar character, or \. to mean a literal period.
What should we write instead?
(November|Nov\.) ([1-9]|[12][0-9]|30)
Here is a regular expression for legal C identifiers (variable names),
which must begin with a letter or underscore, and can contain further
letters, digits, and underscores (e.g., main or
__Z6rotateiii)
[a-zA-Z_][a-zA-Z0-9_]*
Here is a regular expression for legal Ada identifiers (variable names),
which must begin with a letter, and can contain further letters, digits,
and underscores, but cannot have two underscores in a row and cannot end
in an underscore (e.g., count2 or first_nonzero_entry):
[a-zA-Z]([a-zA-Z0-9]|_[a-zA-Z0-9])*
Another possible solution would be:
[a-zA-Z](_?[a-zA-Z0-9]+)*
The first standard for the C language (ANSI C) described the legal integer constants as follows:
An integer constant consisting of a sequence of digits is taken to be octal if it begins with 0 (digit zero), decimal otherwise. Octal constants do not contain the digits 8 or 9. A sequence of digits preceded by 0x or 0X (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits include a or A through £ or F with values 10 through 15.
An integer constant may be suffixed by the letter u or U, to specify that it is unsigned. It may also be suffixed by the letter l or L to specify that it is long.
This prose description is arguably ambiguous. Can the L suffix come before the U suffix? When it refers to a sequence of digits preceded by 0X, can this sequence be an empty sequence?
We can make the description unambiguous by giving a regular expression, e.g.,
([1-9][0-9]+ | 0[0-7]* | 0[Xx][0-9a-fA-F]+)
([Ll][Uu]?|[Uu][Ll]?)?
(There are many other equivalent regular expressions as well.)
Different applications may use different syntaxes, e.g., which of |
and \| means the literal bar character, and which means the regular-expression
alternative operation.
For example, if we want a regular expression that matches the words
bear, bead, bar, and bad, we must write:
b(ea|a)(r|d) in Perl or VS Codeb\(ea|a\)\(r|d\) in the tool sedb\(ea\|a\)\(r\|d\) in the editor EmacsThe moral is that when you want to use regular expressions in a new context, you have to learn what syntax it expects when writing regular expressions.
The command line in most operating systems uses syntax that looks somewhat like regular expressions, but is slightly different (and more restrictive). The common name for this command-line syntax is “globbing”.
Suppose we have a directory with the following files:
> ls
bad bag bar bead beg bear beer bug ear rag rear rugThen:
> ls b*r
bar bear beer(A star in a glob corresponds to .* in a regular expression.)
> ls b?ar
bear(A question mark in a glob corresponds to .? in a regular expression.)
> ls [br]ear`
bear rear(Bracket notation has the same meaning.)
> {`ls b{ea,a}{r,d}
bad bar bead bear(Globs use a list of items in curly braces instead of |.)
> `ls {?,r?}ar`
bar ear rear(We can nest glob notation.)
> ls b.g
ls: b.g: No such file or directory:(In globs, a period stands for the period character.)