6.2 String Theory

An alphabet is a nonempty finite set, usually denoted \(\Sigma\). > >Elements of the set \(\Sigma\) are called letters or symbols or characters.

Here \(\Sigma\) is just a Greek letter being used for the name of a particular set; it has nothing to do with summation.

Here are some common examples of alphabets: > >- \(\{0,1\}\) (the set of binary digits) >- \(\{\mathtt{a},\ldots,\mathtt{z}\}\) (the set of lowercase ASCII letters) >- All ASCII characters >- All Unicode characters

Strings

A string over \(\Sigma\) is a finite, but possibly empty, sequence of characters from \(\Sigma\). > If the alphabet is obvious from context, we just say string.

Notation: Dropping the Quotes

In Python, to write a string we need to use quotes, e.g., '37', or "37". This lets us tell the difference between, e.g., the string '37' and the integer 37.

In Theory of Computability, though, we are always working with strings, and using quotation marks all the time quickly becomes tedious. Therefore, when studying Computability we use the convention that we omit all the quotation marks. For example, we would write \(\mathtt{abc}\) instead of "abc". #### Is x a symbol or a string of length 1?

It’s true that if we’ve dropped the quotes, the notation x is ambiguous; it could be a character from our alphabet \(\Sigma\), or it could be a string of length 1 containing that character.

In practice, the difference rarely matters. And in those very rare cases where it does matter, we can tell from context which interpretation was intended.

What about the empty string?

If we omit the quotes, then the empty string "" (a sequence of zero characters from our alphabet) turns invisible, and we can’t see whether there’s a string there at all! This is awkward, so for the empty string we will use different notation.

We could continue to write "" as is done in many programming languages, but the tradition in computability is to use a greek letter. In this class, we will use the symbol \(\epsilon\) to stand for the empty string.

(Some books use \(\lambda\) instead of \(\epsilon\) but there are better uses for \(\lambda\) in theoretical computer science, as you will see in the Programming Languages course.)

Note that \(\epsilon\) is notation for a string (a sequence of characters), rather than a character. Consequently, we can be sure that \(\epsilon\not\in\Sigma\). #### But what if my alphabet is \(\{\alpha, \beta, \gamma, \delta, \epsilon, \ldots \omega\}\) ?

In the rare case where you want to use the Greek letter epsilon as a character in your strings, you just need to choose a different symbol (such as "") to represent the empty string. ### Operations on Strings

If \(w\) and \(v\) represent strings, then > >- \(wv\) is the result of concatenating \(w\) and \(v\) together. >- \(|w|\) is the length of \(w\), the number of characters. >- \(w^R\) is the reverse of \(w\), the same characters in the opposite order.

Various properties of this operation immediately follow, e.g., > >- \(w\epsilon = w = \epsilon w\) >- \(|w| = |w^R|\) >- \(|wv| = |w| + |v|\) >- \((wv)^R= v^R w^R\)

Formal Languages

A formal language (over \(\Sigma\)) is a set of strings (over \(\Sigma\)) > Formal languages can be empty, finite or infinite, but (by definition of “string”) they only contain finite strings.

Example

The set \(\Sigma^*\) of all strings over \(\Sigma\) is a language.
The empty set \(\emptyset\) is a language.
The set \(\{\epsilon\}\) is a different language (i.e., the nonempty set of strings whose only member is the empty string).
We can think of “all grammatically valid English sentences” as a language, but unless we can unambiguously specify which sentences are valid English and which are not, English is not a formal language.

Since all the languages we’ll discuss in Computability Theory are formal (i.e., mathematically precise), we will often just say language to mean a set of strings.

Operations on Languages

Because languages are sets, we can union two languages (\(L\cup M\)) intersect two languages (\(L\cap M\)), or take the set-difference (\(L\setminus M\)). But since they are more specifically sets of strings, we can define additional operations.

If \(L\) and \(M\) are languages, then > >- \(LM := \{\ wv\ |\ w\in L,\ v\in M\ \ \}\) >- \(L^1 := L\)
>\(L^2 := LL = \{\ wv\ |\ w\in L, v\in L\ \}\)
>\(L^3 := LLL = \{\ wvu\ |\ w\in L, v\in L, u\in L\ \}\)
>In general, \(L^{n+1} := L\,L^n\) where we define \(L^0 := \{\epsilon\}\). >- \(L^* := L^0 \cup L^1 \cup L^2 \cup L^3 \cup \cdots\)
> i.e., \(L^*\) is all the strings we can construct by picking strings from \(L\) zero or more times (but finitely many) and concatenating them. >- \(L^+ := L^1 \cup L^2 \cup L^3 \cup \cdots\)
> i.e., \(L^+\) is all the strings we can construct by picking strings from \(L\) one or more times (but finitely many) and concatenating them.

If \(L = \{\mathtt{ba}, \mathtt{da}\}\) and \(M = \{\mathtt{da}, \mathtt{rk}, \epsilon\}\), then > >- \(L\cup M = \{\mathtt{ba}, \mathtt{da}, \mathtt{rk}, \epsilon\}\) >- \(L\cap M = \{\mathtt{da}\}\) >- \(L\setminus M = \{\mathtt{ba}\}\) >- \(LM = \{\mathtt{bada}, \mathtt{bark}, \mathtt{ba}, \mathtt{dada}, \mathtt{dark}, \mathtt{da}\}\) >- \(L^3 = \{\mathtt{bababa}, \mathtt{babada}, \mathtt{badaba}, \ldots, \mathtt{dadaba},\mathtt{dadada}\}\) >- \(L^{*} = \{\epsilon, \mathtt{ba}, \mathtt{da}, \mathtt{baba}, \ldots, \mathtt{dada}, \mathtt{bababa}, \ldots\}\) >- \(L^+ = \{\mathtt{ba}, \mathtt{da}, \mathtt{baba}, \ldots, \mathtt{dada}, \mathtt{bababa},\ldots\}\) >- \(M^{*} = \{\epsilon, \mathtt{da}, \mathtt{rk}, \mathtt{dada}, \ldots, \mathtt{rkrk}, \mathtt{dadada}, \ldots\}\) >- \(M^+ = \{\epsilon, \mathtt{da}, \mathtt{rk}, \mathtt{dada}, \ldots, \mathtt{rkrk}, \mathtt{dadada}, \ldots\}\) > Note that \(\epsilon\in L^*\) and \(\epsilon\in M^*\) by definition of the star operator. On the other hand, \(\epsilon\not\in L^+\) (because there’s no way to construct an empty string by concatenating one or more elements of \(L\)), but \(\epsilon\in M^+\) (because we can construct an empty string by concatenating one or more elements of \(M\), as long as those elements are \(\epsilon\)).

Given the definitions of the operations, the following equivalences all hold. > >- \(L\emptyset = \emptyset\) > > There is no way to concatenate something from \(L\) with something from \(\emptyset\). > >- \(L\{\epsilon\} = L\) >- \(\{\epsilon\}^{*} = \{\epsilon\}\) > > Appending zero or more empty strings only yields the empty string. > >- \(\{\epsilon\}^+ = \{\epsilon\}\) > > Appending one or more empty strings only yields the empty string. > >- \(\emptyset^{*} = \{\epsilon\}\) > >Appending zero strings from \(\emptyset\) yields an empty string by definition. > >- \(\emptyset^+ = \emptyset\) > > There are no ways to concatenate one or more strings taken from the empty set. > >- \((L\cup M)N = LN \cup MN\) >- \((L^*)^* = L^*\)