CS 70

Phase 1: Exploring the Interactive AI Demo

Machine Learning Basics (any% speedrun)

This isn't a machine learning class (take CS 158 for that!), so we won't go into excruciating detail here. But it's still useful to establish some fundamental terminology, so we can talk about the machine learning parts of the homework more easily. We will bold key terms that will come up throughout the assignment.

The machine learning system you're building in this homework isn't a full-on chatbot / language model like ChatGPT or Gemini. Instead, it's a simpler type of model known as a classifier. Given some input, a classifier returns what class (from a fixed set of "classes") that input most likely belongs to.

  • LHS Cow speaking

    Caution! Here, "class" doesn't refer to C++ classes. Instead, think of it as a synonym for "category". In our case, the classes (categories) are Pokemon names and software/app names.

  • Alien speaking

    For that reason, on my planet we refer to these models as "categorizers".

  • LHS Cow speaking

    The C++ and machine learning parts of the homework are kept pretty separate, so it should always be clear what the word "class" is referring to in context.

  • RHS Cow speaking

    But if you get lost, don't be afraid to ask an instructor or grutor!

Most classifiers do not read the raw input directly. Instead, they simplify the input by breaking it down into a vector of features. A "feature" is just some property of the input; exactly what properties are important depends on the specific task and is often decided by humans. For example, when you are working with text (like we're doing in this homework), you could use individual letters as features. So the string "pikachu" would become the vector ['p', 'i', 'k', 'a', 'c', 'h', 'u']. Or, if you want to be more sophisticated, you could use \( n \)-letter sequences as features. For instance, with \( n = 3 \), "pikachu" would become the vector ["pik", "ika", "kac", "chu"] (this is the strategy we'll use for this homework).

The key to machine learning is that, instead of manually writing rules that say "this feature corresponds to this class", a classifier instead learns the relationship between features and classes based on data. This process is known as training: we take the classifier, show it a training set of examples where each example has a label saying what class it belongs to, and the classifier learns the underlying relationship by computing probabilities.

Finally, once we have a trained model, we usually want to know how good it is—that is, we want to evaluate it. One way to do this is by showing the model a new test set of examples it hasn't previously seen, asking it to guess the label for each example, and then computing what fraction of examples it guesses correctly. This metric is known as accuracy.

  • Hedgehog speaking

    My head is spinning! Do we need to implement all of this???

  • LHS Cow speaking

    Don't worry; like we said, CS 70 isn't a machine learning class. We've provided code that implements the core machine learning process—feature extraction, training, and evaluation—for you!

Building the Code

After cloning the homework repository and opening it in VS Code, open the terminal pane and run make (or cs70-make -H) to build the provided code.

Running the Demo

Run the provided machine learning demo:

./nbclassifier-demo -t dataset-train.txt -e dataset-test.txt

This command trains a classifier using the provided file dataset-train.txt, and evaluates it on dataset-test.txt. You should see it print the following:

Test set accuracy: 50.4%
  • Horse speaking

    Hay! 50.4% accuracy seems...kind of terrible? We only have two classes (Pokemon vs software), so 50.4% is essentially a coin flip...

  • LHS Cow speaking

    That's right—because we haven't actually implemented the "learning" part of machine learning yet!

To get a clearer sense of what's happening here, you can run the program in interactive mode by removing the -e dataset-test.txt part of the command:

./nbclassifier-demo -t dataset-train.txt

This still trains the classifier on dataset-train.txt, but instead of running evaluation, it drops you into an interactive environment where you can "talk to" the classifier. In this environment, you can enter any text you like and the classifier will (supposedly) tell you whether the text sounds more like a Pokemon or more like software. You can try playing around with it...but what you'll quickly find is that it just always guesses "software" no matter what you type in! This is because we have not yet implemented any actual machine learning algorithm, and the demo is currently instead using a "dummy classifier" that just parrots the first label it saw in the training set (which happened to be "software"). Your job in this homework is to fix this problem!

Motivation: How Hash Tables Can Help

How can our classifier learn the relationship between features and classes? There are several approaches, but the most straightforward is by computing probabilities.

  • Hedgehog speaking

    Probability is scary! I haven't taken Math 62!

  • LHS Cow speaking

    Don't worry, we'll try to keep things fairly high level here.

Intuitively, given an input that contains the feature "chu" (for example), we want to know: is "chu" more likely to appear in a Pokemon name, or in a software name? In other words, we need the probability of "chu" appearing in Pokemon names and in software names.

  • Rabbit speaking

    In mathematical notation, we could write these as the conditional probabilities \( P(\mathrm{chu}|\mathrm{pokemon}) \) and \( P(\mathrm{chu}|\mathrm{software}) \).

How can we compute these probabilities? The same way we would compute any probability: by counting! For instance, we can estimate the probability of "chu" appearing in Pokemon names by taking all the Pokemon names in the training set and counting how often the feature "chu" appears. To do this efficiently, we need a data structure that lets us look up a specific key, like "chu", and get an associated count.

  • Python speaking

    In Python, I would use a dictionary for this. Like, if pokeCounts was a dict, I could store the count for "chu" in pokeCounts["chu"].

Indeed, the kind of data structure we're describing is a dictionary...and as you know from the lessons, in Python and some other languages, dictionaries are implemented using hash tables.

...Actually, to make things a bit easier, we don't need a full dictionary implementation, A Python-style dictionary can map from a key to any type of value, but we're only interested in keeping counts, so we only need the ability to map from a key to an integer (representing the count of that key). This more limited type of dictionary is sometimes referred to as a multiset—and this is exactly what you're going to be implementing!

  • LHS Cow speaking

    The name "multiset" comes from the fact that this data structure mostly acts like a set, except that you can insert a key multiple times (which is equivalent to incrementing its count).

  • Python speaking

    But in Python, we call this data structure a Counter, which maybe some of you have seen before.

To Complete This Part of the Assignment…

You'll know you're done with this part of the assignment when you've done all of the following:

(When logged in, completion status appears here.)