Strings
“There is geometry in the humming of the strings”
—Pythagoras
Alphabets
Definition
An alphabet is a nonempty finite set, usually denoted Σ.
Elements of the set Σ are called letters or
symbols or characters.
Here Σ is just a Greek letter being used for the name of a
particular set; it has nothing to do with summation.
Example
Here are some common examples of alphabets:
- {0,1} (the set of binary digits)
- {a,…,z} (the
set of lowercase ASCII letters)
- All ASCII characters
- All Unicode characters
Strings
Definition
A string over Σ is a finite, but possibly empty, sequence
of characters from Σ.
If the alphabet is obvious from context, we just say
string.
Notation: Dropping the Quotes
In Python, to write a string we need to use quotes, e.g., '37', or
"37". This lets us tell the difference between, e.g., the string
'37' and the integer 37.
In Theory of Computability, though, we are always working with strings,
and using quotation marks all the time quickly becomes tedious.
Therefore, when studying Computability we use the convention that we
omit all the quotation marks. For example, we would write
abc instead of "abc".
Is x a symbol or a string of length 1?
It’s true that if we’ve dropped the quotes, the notation x is
ambiguous; it could be a character from our alphabet Σ, or it
could be a string of length 1 containing that character.
In practice, the difference rarely matters. And in those very rare cases
where it does matter, we can tell from context which interpretation was
intended.
What about the empty string?
If we omit the quotes, then the empty string "" (a sequence of
zero characters from our alphabet) turns invisible, and we can’t see
whether there’s a string there at all! This is awkward, so for the
empty string we will use different notation.
We could continue to write "" as is done in many programming
languages, but the traditition on computability is to use a greek
letter. In this class, we will use the symbol ϵ to
stand for the empty string.
(Some books use λ instead of ϵ but there are better
uses for λ in theoretical computer science, as you will see in
the Programming Languages course.)
Note that ϵ is notation for a string (a sequence of
characters), rather than a character. Consequently, we can be sure that
ϵ∈Σ.
But what if my alphabet is {α,β,γ,δ,ϵ,…ω} ?
In the rare case where you want to use the Greek letter epsilon as a
character in your strings, you just need to choose a different symbol
(such as "") to represent the empty string.
Operations on Strings
Definition
If w and v represent strings, then
- wv is the result of concatenating w and v together.
- ∣w∣ is the length of w, the number of characters.
- wR is the reverse of w, the same characters in the opposite
order.
Example
Various properties of this operation immediately follow, e.g.,
- wϵ=w=ϵw
- ∣w∣=∣wR∣
- ∣wv∣=∣w∣+∣v∣
- (wv)R=vRwR
Definition
A formal language (over Σ) is a set of strings (over
Σ)
Formal languages can be empty, finite or infinite, but (by definition of
“string”) they only contain finite strings.
Example
- The set Σ∗ of all strings over Σ is a language.
- The empty set ∅ is a language.
- The set {ϵ} is a different
language (i.e., the nonempty set of strings whose only member is the
empty string).
- We can think of “all grammatically valid English sentences” as a
language, but unless we can unambiguously specify which sentences
are valid English and which are not, English is not a formal
language.
Since all the languages we’ll discuss in Computability Theory are
formal (i.e., mathematically precise), we will often just say
language to mean a set of strings.
Operations on Languages
Because languages are sets, we can union two languages (L∪M)
intersect two languages (L∩M), or take the set-difference
(L∖M). But since they are more specifically sets of strings,
we can define additional operations.
Definition
If L and M are languages, then
- LM:={ wv ∣ w∈L, v∈M }
- L1:=L
L2:=LL={ wv ∣ w∈L,v∈L }
L3:=LLL={ wvu ∣ w∈L,v∈L,u∈L }
In general, Ln+1:=LLn where we
define L0:={ϵ}.
- L∗:=L0∪L1∪L2∪L3∪⋯
I.e., L∗ is all the strings we can construct by picking strings
from L zero or more times (but finitely many) and concatenating
them.
- L+:=L1∪L2∪L3∪⋯
I.e., L+ is all the strings we can construct by picking strings
from L one or more times (but finitely many) and concatenating
them.
Example
If L={ba,da} and M={da,rk,ϵ}, then
- L∪M={ba,da,rk,ϵ}
- L∩M={da}
- L∖M={ba}
- LM={bada,bark,ba,dada,dark,da}
- L3={bababa,babada,badaba,…,dadaba,dadada}
- L∗={ϵ,ba,da,baba,…,dada,bababa,…,badadaba,…}
- L+={ba,da,baba,…,dada,bababa,…,badadaba,…}
- M∗={ϵ,da,rk,dada,…,rkrk,dadada,…}
- M+={ϵ,da,rk,dada,…,rkrk,dadada,…}
Note that ϵ∈L∗ and ϵ∈M∗ by definition of the
star operator. On the other hand, ϵ∈L+ (because
there’s no way to construct an empty string by concatenating one or
more elements of L), but ϵ∈M+ (because we can
construct an empty string by concatenating one or more strings from M,
as long as those strings are all ϵ).
Example
Given the definitions of the operations, the following equivalences all
hold.
-
L∅=∅
There is no way to concatenate something from L with something
from the empty set ∅.
-
L{ϵ}=L
-
{ϵ}∗={ϵ}
Appending zero or more empty strings only yields the empty string.
-
{ϵ}+={ϵ}
Appending one or more empty strings only yields the empty string.
-
∅∗={ϵ}
Appending zero strings from ∅ yields an empty string by
definition.
-
∅+=∅
There are no ways to concatenate one or more strings taken from the
empty set.
-
(L∪M)N=LN∪MN
-
(L∗)∗=L∗