Extended Regular Expressions

If you’ve ever talked about regular expressions with another programmer, you’ve invariably heard:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The quote is from Jamie Zawinski, a world class hacker. [But in that same discussion he also said]: “Perl’s nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most ‘obvious’ (at least, to people who don’t know any better) way to get from point A to point B.

Here’s the point Jamie was trying to make: not that regular expressions are evil, per se, but that overuse of regular expressions is evil… Regular expressions are like a particularly spicy hot sauce, to be used in moderation and with restraint only when appropriate. If you drench your plate in hot sauce, you’re going to be very, very sorry later. In the same way that I can’t imagine food without a dash of hot sauce now and then, I can’t imagine programming without an occasional regular expression. It’d be a bland, unsatisfying experience.

—Jeff Atwood

Regular expressions were originally defined for theoretical reasons, but they have turned out to be quite handy in day-to-day programming, working on strings of ASCII or Unicode characters.

Working exclusively with concatenation, union, and star can sometimes be a bit awkward, so there a number of conventional abbreviations used when programming with regular expressions:

Abbreviation	Meaning
`R?`	`(R\|ε)`
`R+`	`RR*`
`[Aabcdfghiou]`	`(A\|a\|b\|c\|d\|f\|g\|h\|i\|o\|u)`
`[Aa-df-iou]`	(same as above)
`[^Aabcdfghiou]`	Any single character other than A,a,b,c,d,f,g,h,i,o,u.
`[^Aa-df-iou]`	(same as above)
`.`	Any single character (except newline, usually).
`R{3,5}`	`RRR\|RRRR\|RRRRR`
`\d`	Any digit character
`\s`	Any whitespace character (space, tab, newline, etc.)
`^`	Start of input (or start of line)
`$`	End of input (or end of line)

Example

The RX library used this regular expression in one of its unit tests:

M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]

That is, there must be M, then either o or u, then optionally an apostrophe, then one or more m’s, and so on. Examples of strings matchine this regular expression include:

Muammar Qaddafi
Mo'ammar Gadhafi
Muammar Kaddafi
Muammar Qadhafi
Moammar El Kadhafi
Muammar Gadafi
Mu'ammar al-Qadafi
Moamer El Kazzafi
Moamar al-Gaddafi
Mu'ammar Al Qathafi
Muammar Al Qathafi
Mo'ammar el-Gadhafi
Moamar El Kadhafi
Muammar al-Qadhafi
Mu'ammar al-Qadhdhafi
Mu'ammar Qadafi
Moamar Gaddafi
Mu'ammar Qadhdhafi
Muammar al-Khaddafi
Mu'amar al-Kadafi
Muammar Ghaddafy
Muammar Ghadafi
Muammar Ghaddafi
Muamar Kaddafi
Muammar Quathafi
Muammar Gheddafi
Muamar Al-Kaddafi
Moammar Khadafy
Moammar Qudhafi
Mu'ammar al-Qaddafi
Mu'ammar Muhammad Abu Minyar al-Qadhafi

The above example is a regular expression that matches all plausible transliterations of the name of a former Libyan dictator from Arabic into the English (Roman) alphabet. It also matches some very implausible transliterations (e.g., where the m+ part of the regular expression corresponds to 17 m’s). But if we are searching a big database of newspaper and magazine articles, it’s fine to use an overinclusive regular expression because that ensures we’ll catch all relevant articles.

On the other hand, there are many situations where we want a regular expression that precisely matches the strings we are interested in.

Example

Suppose we want a regular expression that matches dates in November.

What’s wrong with [November|Nov.] [1-30]?
What should we write instead?

Recall that characters in square brackets mean “any single character among those listed.” So [November|Nov.] [1-30] is just another way of writing (N|o|v|e|m|b|e|r|\||N|o|v|\.) (1|2|3|0) and matches input strings like v 2 or m 1.
Inside brackets, all characters except ^ and - always stand for themselves. Outside brackets, we often need to say \| to mean the literal bar character, or \. to mean a literal period.
What should we write instead?

(November|Nov\.) ([1-9]|[12][0-9]|30)

Example

Here is a regular expression for legal C identifiers (variable names), which must begin with a letter or underscore, and can contain further letters, digits, and underscores (e.g., main or __Z6rotateiii)

[a-zA-Z_][a-zA-Z0-9_]*

Example

Here is a regular expression for legal Ada identifiers (variable names), which must begin with a letter, and can contain further letters, digits, and underscores, but cannot have two underscores in a row and cannot end in an underscore (e.g., count2 or first_nonzero_entry):

[a-zA-Z]([a-zA-Z0-9]|_[a-zA-Z0-9])*

Another possible solution would be:

[a-zA-Z](_?[a-zA-Z0-9]+)*

Example

The first standard for the C language (ANSI C) described the legal integer constants as follows:

An integer constant consisting of a sequence of digits is taken to be octal if it begins with 0 (digit zero), decimal otherwise. Octal constants do not contain the digits 8 or 9. A sequence of digits preceded by 0x or 0X (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits include a or A through £ or F with values 10 through 15.

An integer constant may be suffixed by the letter u or U, to specify that it is unsigned. It may also be suffixed by the letter l or L to specify that it is long.

This prose description is arguably ambiguous. Can the L suffix come before the U suffix? When it refers to a sequence of digits preceded by 0X, can this sequence be an empty sequence?

We can make the description unambiguous by giving a regular expression, e.g.,

([1-9][0-9]+ | 0[0-7]* | 0[Xx][0-9a-fA-F]+)
([Ll][Uu]?|[Uu][Ll]?)?

(There are many other equivalent regular expressions as well.)

Warning 1: Syntax Variations

Different applications may use different syntaxes, e.g., which of | and \| means the literal bar character, and which means the regular-expression alternative operation.

For example, if we want a regular expression that matches the words bear, bead, bar, and bad, we must write:

b(ea|a)(r|d) in Perl or VS Code
b$ea|a$$r|d$ in the tool sed
b$ea\|a$$r\|d$ in the editor Emacs

The moral is that when you want to use regular expressions in a new context, you have to learn what syntax it expects when writing regular expressions.

Warning 2: Globs

The command line in most operating systems uses syntax that looks somewhat like regular expressions, but is slightly different (and more restrictive). The common name for this command-line syntax is “globbing”.

Example

Suppose we have a directory with the following files:

> ls
bad bag bar bead beg bear beer bug ear rag rear rug

Then:

> ls b*r
bar bear beer

(A star in a glob corresponds to .* in a regular expression.)

> ls b?ar
bear

(A question mark in a glob corresponds to .? in a regular expression.)

> ls [br]ear`
bear rear

(Bracket notation has the same meaning.)

> {`ls b{ea,a}{r,d}
bad bar bead bear

(Globs use a list of items in curly braces instead of |.)

> `ls {?,r?}ar`
bar ear rear

(We can nest glob notation.)

> ls b.g
ls: b.g: No such file or directory:

(In globs, a period stands for the period character.)