Strings

“There is geometry in the humming of the strings”
—Pythagoras

Alphabets

Definition

An alphabet is a nonempty finite set, usually denoted Σ\Sigma.

Elements of the set Σ\Sigma are called letters or symbols or characters.

Here Σ\Sigma is just a Greek letter being used for the name of a particular set; it has nothing to do with summation.

Example

Here are some common examples of alphabets:

  • {0,1}\{0,1\} (the set of binary digits)
  • {a,,z}\{\mathtt{a},\ldots,\mathtt{z}\} (the set of lowercase ASCII letters)
  • All ASCII characters
  • All Unicode characters

Strings

Definition

A string over Σ\Sigma is a finite, but possibly empty, sequence of characters from Σ\Sigma.

If the alphabet is obvious from context, we just say string.

Notation: Dropping the Quotes

In Python, to write a string we need to use quotes, e.g., '37', or "37". This lets us tell the difference between, e.g., the string '37' and the integer 37.

In Theory of Computability, though, we are always working with strings, and using quotation marks all the time quickly becomes tedious. Therefore, when studying Computability we use the convention that we omit all the quotation marks. For example, we would write abc\mathtt{abc} instead of "abc".

Is x a symbol or a string of length 1?

It’s true that if we’ve dropped the quotes, the notation x is ambiguous; it could be a character from our alphabet Σ\Sigma, or it could be a string of length 1 containing that character.

In practice, the difference rarely matters. And in those very rare cases where it does matter, we can tell from context which interpretation was intended.

What about the empty string?

If we omit the quotes, then the empty string "" (a sequence of zero characters from our alphabet) turns invisible, and we can’t see whether there’s a string there at all! This is awkward, so for the empty string we will use different notation.

We could continue to write "" as is done in many programming languages, but the traditition on computability is to use a greek letter. In this class, we will use the symbol ϵ\epsilon to stand for the empty string.

(Some books use λ\lambda instead of ϵ\epsilon but there are better uses for λ\lambda in theoretical computer science, as you will see in the Programming Languages course.)

Note that ϵ\epsilon is notation for a string (a sequence of characters), rather than a character. Consequently, we can be sure that ϵ∉Σ\epsilon\not\in\Sigma.

But what if my alphabet is {α,β,γ,δ,ϵ,ω}\{\alpha, \beta, \gamma, \delta, \epsilon, \ldots \omega\} ?

In the rare case where you want to use the Greek letter epsilon as a character in your strings, you just need to choose a different symbol (such as "") to represent the empty string.

Operations on Strings

Definition

If ww and vv represent strings, then

  • wvwv is the result of concatenating ww and vv together.
  • w|w| is the length of ww, the number of characters.
  • wRw^R is the reverse of ww, the same characters in the opposite order.

Example

Various properties of this operation immediately follow, e.g.,

  • wϵ=w=ϵww\epsilon = w = \epsilon w
  • w=wR|w| = |w^R|
  • wv=w+v|wv| = |w| + |v|
  • (wv)R=vRwR(wv)^R= v^R w^R

Formal Languages

Definition

A formal language (over Σ\Sigma) is a set of strings (over Σ\Sigma)

Formal languages can be empty, finite or infinite, but (by definition of “string”) they only contain finite strings.

Example

  • The set Σ\Sigma^* of all strings over Σ\Sigma is a language.
  • The empty set \emptyset is a language.
  • The set {ϵ}\{\epsilon\} is a different language (i.e., the nonempty set of strings whose only member is the empty string).
  • We can think of “all grammatically valid English sentences” as a language, but unless we can unambiguously specify which sentences are valid English and which are not, English is not a formal language.

Since all the languages we’ll discuss in Computability Theory are formal (i.e., mathematically precise), we will often just say language to mean a set of strings.

Operations on Languages

Because languages are sets, we can union two languages (LML\cup M) intersect two languages (LML\cap M), or take the set-difference (LML\setminus M). But since they are more specifically sets of strings, we can define additional operations.

Definition

If LL and MM are languages, then

  • LM:={ wv  wL, vM  }LM := \{\ wv\ |\ w\in L,\ v\in M\ \ \}
  • L1:=LL^1 := L
    L2:=LL={ wv  wL,vL }L^2 := LL = \{\ wv\ |\ w\in L, v\in L\ \}
    L3:=LLL={ wvu  wL,vL,uL }L^3 := LLL = \{\ wvu\ |\ w\in L, v\in L, u\in L\ \}
    In general, Ln+1:=LLnL^{n+1} := L\,L^n where we define L0:={ϵ}L^0 := \{\epsilon\}.
  • L:=L0L1L2L3L^* := L^0 \cup L^1 \cup L^2 \cup L^3 \cup \cdots
    I.e., LL^* is all the strings we can construct by picking strings from LL zero or more times (but finitely many) and concatenating them.
  • L+:=L1L2L3L^+ := L^1 \cup L^2 \cup L^3 \cup \cdots
    I.e., L+L^+ is all the strings we can construct by picking strings from LL one or more times (but finitely many) and concatenating them.

Example

If L={ba,da}L = \{\mathtt{ba}, \mathtt{da}\} and M={da,rk,ϵ}M = \{\mathtt{da}, \mathtt{rk}, \epsilon\}, then

  • LM={ba,da,rk,ϵ}L\cup M = \{\mathtt{ba}, \mathtt{da}, \mathtt{rk}, \epsilon\}
  • LM={da}L\cap M = \{\mathtt{da}\}
  • LM={ba}L\setminus M = \{\mathtt{ba}\}
  • LM={bada,bark,ba,dada,dark,da}LM = \{\mathtt{bada}, \mathtt{bark}, \mathtt{ba}, \mathtt{dada}, \mathtt{dark}, \mathtt{da}\}
  • L3={bababa,babada,badaba,,dadaba,dadada}L^3 = \{\mathtt{bababa}, \mathtt{babada}, \mathtt{badaba}, \ldots, \mathtt{dadaba}, \mathtt{dadada}\}
  • L={ϵ,ba,da,baba,,dada,bababa,,badadaba,}L^{*} = \{\epsilon, \mathtt{ba}, \mathtt{da}, \mathtt{baba}, \ldots, \mathtt{dada}, \mathtt{bababa}, \ldots, \mathtt{badadaba}, \ldots\}
  • L+={ba,da,baba,,dada,bababa,,badadaba,}L^+ = \{\mathtt{ba}, \mathtt{da}, \mathtt{baba}, \ldots, \mathtt{dada}, \mathtt{bababa}, \ldots, \mathtt{badadaba}, \ldots\}
  • M={ϵ,da,rk,dada,,rkrk,dadada,}M^{*} = \{\epsilon, \mathtt{da}, \mathtt{rk}, \mathtt{dada}, \ldots, \mathtt{rkrk}, \mathtt{dadada}, \ldots\}
  • M+={ϵ,da,rk,dada,,rkrk,dadada,}M^+ = \{\epsilon, \mathtt{da}, \mathtt{rk}, \mathtt{dada}, \ldots, \mathtt{rkrk}, \mathtt{dadada}, \ldots\}

Note that ϵL\epsilon\in L^* and ϵM\epsilon\in M^* by definition of the star operator. On the other hand, ϵ∉L+\epsilon\not\in L^+ (because there’s no way to construct an empty string by concatenating one or more elements of LL), but ϵM+\epsilon\in M^+ (because we can construct an empty string by concatenating one or more strings from MM, as long as those strings are all ϵ\epsilon).

Example

Given the definitions of the operations, the following equivalences all hold.

  • L=L\emptyset = \emptyset

    There is no way to concatenate something from LL with something from the empty set \emptyset.

  • L{ϵ}=LL\{\epsilon\} = L

  • {ϵ}={ϵ}\{\epsilon\}^{*} = \{\epsilon\}

    Appending zero or more empty strings only yields the empty string.

  • {ϵ}+={ϵ}\{\epsilon\}^+ = \{\epsilon\}

    Appending one or more empty strings only yields the empty string.

  • ={ϵ}\emptyset^{*} = \{\epsilon\}

    Appending zero strings from \emptyset yields an empty string by definition.

  • +=\emptyset^+ = \emptyset

    There are no ways to concatenate one or more strings taken from the empty set.

  • (LM)N=LNMN(L\cup M)N = LN \cup MN

  • (L)=L(L^*)^* = L^*