CS 70

Phase 8: Using Your HashMultiset<T> to Finish the Machine Learning Code

At this point, you should have a fully working HashMultiset<T> implementation. So now, at long last, we can return to the machine learning demo code and use your HashMultiset<T> to implement the actual, well, learning part of the code!

In this phase, we'll be mainly working with the files nbclassifier.hpp and nbclassifier.cpp, which together define an NBClassifier class. You will see that the vast majority of the code is already provided for you. There are only two TODOs left for you to fill in: one in the fit function (which implements training) and one in the transform function (which implements running the trained model, and is called both in evaluation mode and interactive mode).

  • LHS Cow speaking

    You don't need to spend too much time trying to understand the machine learning code (again, CS 70 isn't a machine learning course).

  • RHS Cow speaking

    But you should at least understand the NBClassifier public interface so you know what you're working with.

Both TODOs relate to the part of the algorithm that counts features and computes probabilities, because that's where a multiset comes in handy. Specifically, NBClassifier contains a private member variable classFeatureCounts_ whose type is HashMultiset<std::string>. For the TODOs, we basically just stripped out all the code that uses classFeatureCounts_ and we're having you fill it back in.

Background: What Probabilities are we Computing?

  • LHS Cow speaking

    To avoid overwhelming you with math, here we'll only explain what probabilities are needed for the machine learning algorithm to work.

  • RHS Cow speaking

    But for students who are curious about the algorithm itself, which we don't cover here, we've provided a Deeper Dive section that explains it in more detail.

Recall from Phase 1 that our machine learning approach is based on the probability of each feature in a given class. For example, the algorithm needs to know: "how likely is the feature "eon" to appear in a Pokemon name?" and also "how likely is the feature "eon" to appear in a software/app/tech-startup name?"

Suppose you've counted how many times "eon" appears in all Pokemon names. How do you then turn that into a probability? Well, you need to know how that count compares to the count of all other features in Pokemon names. In other words, you should divide this count by the sum of all the counts:

$$ P(\mathrm{eon}|\mathrm{Pokemon}) = \frac{\text{# of times "eon" appears in all Pokemon names}}{\text{Sum of the counts of all features in all Pokemon names}} $$

More generally, for any feature \( f \in F \) (where \( F \) is the set of all features) and class \( c \):

$$ P(f|c) = \frac{\text{# of times } f \text{ appears in examples labeled } c}{\sum_{f' \in F}\text{# of times } f' \text{ appears in examples labeled } c} $$

So, your job in this phase is twofold: store the necessary counts (as seen in the equation above) in classFeatureCounts_, and then implement a function to turn those counts into probabilities (using the above equation).

  • Horse speaking

    Hay, I see a problem! You said we only have one multiset, classFeatureCounts_. In that case, how do we know whether a count is for the "Pokemon" class or for the "software" class?

  • LHS Cow speaking

    Great question!

Perhaps the most "proper" way to keep track of the counts is by keeping a vector of multisets, one for each class. But we didn't want to overcomplicate the code, since that would distract from the learning goal of getting familiar with using hash tables.

  • Hedgehog speaking

    Phew! I was getting scared even just imagining a "vector of multisets"!

Instead, we'll use a bit of a trick: modifying the key itself before we insert it. One way to distinguish between "'eon' in Pokemon" and "'eon' in software" is by adding the class label as a suffix. So, we'll use the key "eon|Pokemon" to store the count of the feature "eon" in the "Pokemon" class (and likewise for the "software" class with "eon|software"). We've provided a helper function formatFeature(feature, class) that does this suffixing for you. For example, calling formatFeature("eon", "software"); returns "eon|software".

We'll use a similar trick to streamline computing the denominator in the above equation. Naively, taking the sum of the counts of all features requires looping over all features and summing their counts (that is, after all, how the equation is written). But that's a lot of work and again makes the code more complicated. The trick is that we can iteratively update the total while we're counting features! Every time we increment the count of a feature f|c, we can ALSO increment the count of a "fake" feature "ANY"|c. By doing this, when we're done counting features, the key "ANY|Pokemon" will contain the sum total of feature counts in the "Pokemon" class, and likewise for "software" with the key "ANY|software".

  • LHS Cow speaking

    You may want to take a moment to convince yourself that this is true.

Now that you know what to count and what to compute, let's actually do it!

Your Task, Part A: Count Features in fit

The fit function already contains a loop over features feat and class labels label. In the body of the loop there is a TODO where you need to use these variables to increment the necessary counts in classFeatureCounts_. Specifically, as described above, you should perform the following insertions:

  • Increment the count of the feature feat in the class label
  • Increment the count of the "fake feature" "ANY" in the class label

You should use the helper function formatFeature to properly format the keys for insertion.

In addition, you will also need to perform two other insertions not related to the previously discussed equation (if you want to know why these are needed, we give a detailed explanation in the Deeper Dive below).

  • Increment the count of the class label. The algorithm will use this to determine how much bias there should be towards a particular class.
  • Increment the count of the feature feat without any class suffix (i.e., just insert feat directly as a key). This is actually only for bookkeeping purposes (knowing what features are present makes some other parts of the code easier).

In total, you should only need to write four lines of code to finish this TODO.

Your Task, Part B: Turn the Counts into Probabilities

The function logProb, which is currently unimplemented, is supposed to implement the equation we showed you earlier.

The slight catch is that in machine learning we typically do all calculations in log space. This is because the probabilities involved are often very small, and we want to avoid floating-point underflow. Going to log space solves this because the log of a very small number is a very big negative number.

In log space, division becomes subtraction. So we get:

$$ \log P(f|c) = \log(\text{# of times } f \text{ appears in examples labeled } c) - \log(\sum_{f' \in F}\text{# of times } f' \text{ appears in examples labeled } c) $$

But there's a problem here: what if \( f \) never appears in \( c \) (i.e., it has a count of 0)? In that case, you would take a log of 0, which is against the rules! We fix this through a method called smoothing: we pretend every feature has been seen once more than it actually was, so anything with a count of 0 gets treated as 1. We could just add 1 to the numerator, but if we don't modify the denominator as well, this will no longer be a valid probability. We'll skip the proof here, but it turns out the correct thing to add is the total number of features. So, letting \( N \) represent the total number of features, the thing logProb actually needs to return is:

$$ \log(1 + \text{ # of times } f \text{ appears in examples labeled } c) - \log(N + \sum_{f' \in F}\text{# of times } f' \text{ appears in examples labeled } c) $$

  • Duck speaking

    But how do I know what \( N \) is?

  • LHS Cow speaking

    Don't worry, the given code calculates it for you. There's a private variable numFeatures_ that the given code correctly updates, and you can use it in your logProb implementation.

  • RHS Cow speaking

    As for the other terms in the equation, the neat thing is you can directly look those up in classFeatureCounts_.

  • LHS Cow speaking

    In particular, remember that for the denominator, you don't need to actually loop to calculate the sum; instead, take advantage of the "fake feature" "ANY".

Your Task, Part C: Change a Setting in nbclassifier-demo.cpp

Now that NBClassifier is fully implemented, we can tell the demo code to use it instead of the basic dummy classifier. Near the top of nbclassifier-demo.cpp, find the following line:

#define DEMO_USE_NBCLASSIFIER 0

And change the 0 to a 1, so the line now reads:

#define DEMO_USE_NBCLASSIFIER 1

Now Run the Code!

Once you've made all your changes, run make to recompile the code. Now try evaluation mode again:

./nbclassifier-demo -t dataset-train.txt -e dataset-test.txt

If everything went right, this should print:

Test set accuracy: 69%

CAUTION! If the demo prints something other than 69%, you should double-check your code. Although we will not be grading the correctness of the machine learning code (it's mainly only used for the written questions), it is possible that getting a result that is extremely off could indicate problems with your HashMultiset.

Finally, have some fun with the demo! You can try interactive mode again:

./nbclassifier-demo -t dataset-train.txt

And this time you should find that the model sometimes guesses "Pokemon" and sometimes guesses "software". It even gives some explanation for its "reasoning"! We'll also do some guided exploration in the written questions.

Deeper Dive: What Machine Learning Algorithm are we Using?

  • RHS Cow speaking

    This deeper dive is optional and is only meant to give more background info on the parts of the machine learning code that you didn't touch.

  • LHS Cow speaking

    You may enjoy reading this if you're curious about machine learning and want a preview of courses like CS 158. But if not, or if you're short on time, you may skip to the bottom of the page to finish this part of the assignment.

The algorithm that NBClassifier implements is known as Naive Bayes (hence the NB part of the name). To understand the intuition behind Naive Bayes, let's start with a minimal example: suppose we only had one feature, the letter sequence "eon", and any given input could either contain this feature or not contain it. Given some specific input, how do we decide the most likely label for that input: Pokemon or software? Well, as noted in Phase 1, we could count all the times "eon" appears in Pokemon names and all the times it appears in software names. If you did this in real life, you would find that "eon" is much more common in Pokemon names.

  • RHS Cow speaking

    For those in the know, that's thanks to all the Eevee evolutions whose names end in "-eon".

So, if the specific input contains the feature "eon", we would intuitively think that it's more likely to belong to the "Pokemon" class.

Of course, this was an overly simplistic example. In reality, there are two complicating factors.

First, the above logic didn't account for the base probablity of a given text being a Pokemon name. After all, maybe there are a lot more apps / tech startups in the world than there are Pokemon! In that case, intuition tells us we should bias our decision more towards the "software" class, and only tip our decision towards "Pokemon" given extremely strong evidence. Mathematically, we can account for this by factoring in the probability \( P(c) \) of each class \( c \), which we can again estimate by counting (e.g., by counting what fraction of samples in the training set are Pokemon).

Second, the above example involved only one feature, but real data involves many features, each with their own probabilities. How do we combine all the probabilities? There's no easy answer to this, and arguably the entire field of machine learning is all about coming up with theories about what the right approach is. But the most naive approach is to pretend that all the features are independent—that is, the presence of one feature tells us nothing about the presence of any other features.

  • LHS Cow speaking

    Hay, that assumption is clearly false! Like, the letters "scr" are often followed by "ipt" because of the suffix "-script" in programming language names!

  • RHS Cow speaking

    Well yes, that's why we call it a naive assumption. It's by no means a realistic one, but in many cases it results in estimates that are surprisingly "good enough".

If probabilities are independent, the laws of probability say that we're allowed to combine them by simply multiplying them together. Taking advantage of this means that Naive Bayes is able to implement a shockingly simple scoring function for computing how likely it is that an input \( i \), containing features \( f \), belongs to a class \( c \):

$$ \mathrm{score}(i,c) = P(c) \prod_{f \in i} P(f|c) $$

(Of course, we would actually do this math in log space).

So, in psuedocode, the Naive Bayes algorithm for classifying a given input is:

INITIALIZE bestScore = -inf
INITIALIZE bestClass = ""
foreach possible class c:
    foreach feature in input:
        COMPUTE score using the above equation
        if score > bestScore:
            UPDATE bestScore = score
            UPDATE bestClass = c
RETURN bestClass

To Complete This Part of the Assignment…

You'll know you're done with this part of the assignment when you've done all of the following:

(When logged in, completion status appears here.)