Phase 1: Exploring the Interactive AI Demo
Machine Learning Basics (any% speedrun)
This isn't a machine learning class (take CS 158 for that!), so we won't go into excruciating detail here. But it's still useful to establish some fundamental terminology, so we can talk about the machine learning parts of the homework more easily. We will bold key terms that will come up throughout the assignment.
The machine learning system you're building in this homework isn't a full-on chatbot / language model like ChatGPT or Gemini. Instead, it's a simpler type of model known as a classifier. Given some input, a classifier returns what class (from a fixed set of "classes") that input most likely belongs to.
Caution! Here, "class" doesn't refer to C++ classes. Instead, think of it as a synonym for "category". In our case, the classes (categories) are Pokemon names and software/app names.
For that reason, on my planet we refer to these models as "categorizers".
The C++ and machine learning parts of the homework are kept pretty separate, so it should always be clear what the word "class" is referring to in context.
But if you get lost, don't be afraid to ask an instructor or grutor!
Most classifiers do not read the raw input directly. Instead, they simplify the input by breaking it down into a vector of features. A "feature" is just some property of the input; exactly what properties are important depends on the specific task and is often decided by humans. For example, when you are working with text (like we're doing in this homework), you could use individual letters as features. So the string "pikachu" would become the vector ['p', 'i', 'k', 'a', 'c', 'h', 'u']. Or, if you want to be more sophisticated, you could use \( n \)-letter sequences as features. For instance, with \( n = 3 \), "pikachu" would become the vector ["pik", "ika", "kac", "chu"] (this is the strategy we'll use for this homework).
The key to machine learning is that, instead of manually writing rules that say "this feature corresponds to this class", a classifier instead learns the relationship between features and classes based on data. This process is known as training: we take the classifier, show it a training set of examples where each example has a label saying what class it belongs to, and the classifier learns the underlying relationship by computing probabilities.
Finally, once we have a trained model, we usually want to know how good it is—that is, we want to evaluate it. One way to do this is by showing the model a new test set of examples it hasn't previously seen, asking it to guess the label for each example, and then computing what fraction of examples it guesses correctly. This metric is known as accuracy.
My head is spinning! Do we need to implement all of this???
Don't worry; like we said, CS 70 isn't a machine learning class. We've provided code that implements the core machine learning process—feature extraction, training, and evaluation—for you!
Building the Code
After cloning the homework repository and opening it in VS Code, open the terminal pane and run make (or cs70-make -H) to build the provided code.
Running the Demo
Run the provided machine learning demo:
./nbclassifier-demo -t dataset-train.txt -e dataset-test.txt
This command trains a classifier using the provided file dataset-train.txt, and evaluates it on dataset-test.txt. You should see it print the following:
Test set accuracy: 50.4%
Hay! 50.4% accuracy seems...kind of terrible? We only have two classes (Pokemon vs software), so 50.4% is essentially a coin flip...
That's right—because we haven't actually implemented the "learning" part of machine learning yet!
To get a clearer sense of what's happening here, you can run the program in interactive mode by removing the -e dataset-test.txt part of the command:
./nbclassifier-demo -t dataset-train.txt
This still trains the classifier on dataset-train.txt, but instead of running evaluation, it drops you into an interactive environment where you can "talk to" the classifier. In this environment, you can enter any text you like and the classifier will (supposedly) tell you whether the text sounds more like a Pokemon or more like software. You can try playing around with it...but what you'll quickly find is that it just always guesses "software" no matter what you type in! This is because we have not yet implemented any actual machine learning algorithm, and the demo is currently instead using a "dummy classifier" that just parrots the first label it saw in the training set (which happened to be "software"). Your job in this homework is to fix this problem!
Motivation: How Hash Tables Can Help
How can our classifier learn the relationship between features and classes? There are several approaches, but the most straightforward is by computing probabilities.
Probability is scary! I haven't taken Math 62!
Don't worry, we'll try to keep things fairly high level here.
Intuitively, given an input that contains the feature "chu" (for example), we want to know: is "chu" more likely to appear in a Pokemon name, or in a software name? In other words, we need the probability of "chu" appearing in Pokemon names and in software names.
In mathematical notation, we could write these as the conditional probabilities \( P(\mathrm{chu}|\mathrm{pokemon}) \) and \( P(\mathrm{chu}|\mathrm{software}) \).
How can we compute these probabilities? The same way we would compute any probability: by counting! For instance, we can estimate the probability of "chu" appearing in Pokemon names by taking all the Pokemon names in the training set and counting how often the feature "chu" appears. To do this efficiently, we need a data structure that lets us look up a specific key, like "chu", and get an associated count.
In Python, I would use a dictionary for this. Like, if
pokeCountswas adict, I could store the count for"chu"inpokeCounts["chu"].
Indeed, the kind of data structure we're describing is a dictionary...and as you know from the lessons, in Python and some other languages, dictionaries are implemented using hash tables.
...Actually, to make things a bit easier, we don't need a full dictionary implementation, A Python-style dictionary can map from a key to any type of value, but we're only interested in keeping counts, so we only need the ability to map from a key to an integer (representing the count of that key). This more limited type of dictionary is sometimes referred to as a multiset—and this is exactly what you're going to be implementing!
The name "multiset" comes from the fact that this data structure mostly acts like a set, except that you can insert a key multiple times (which is equivalent to incrementing its count).
But in Python, we call this data structure a
Counter, which maybe some of you have seen before.
(When logged in, completion status appears here.)