CS 70

Feature Engineering

  • Goat speaking

    69% accuracy is extremely meh. Can't we do better?

  • LHS Cow speaking

    Well, to some extent we're limited by the simplicity of the algorithm we're using...

  • RHS Cow speaking

    ...But we could possibly squeeze out a few more percentage points of accuracy by changing the features!

As we described in Phase 1, our machine learning algorithm never reads the input directly; instead, each input gets represented as a vector of features. For simplicity, we chose to use a relatively straightforward set of features: three-letter sequences found in the input. But features can be any property of the input. Things like the number of characters in the input, whether or not there is punctuation, or what the most common single character is could all count as features. Choosing the right set of features can make a big difference to the accuracy of the classifier, as it is possible that the "wrong" choice of features could obscure or omit critical information about the input that is relevant to making the right prediction. The process of coming up with features to use is called feature engineering, and it is a deeply human process that relies just as much on intuition and "vibes" as it does on mathematical theory.

If you want to change what features are used by our classifier, you will need to edit the extractFeatures function defined in nbclassifier-demo.cpp. This aptly-named function takes a single input and returns a vector of features representing that input. You should see that, right now, it just extracts consecutive length-3 substrings. You may, of course, find it useful to write extra helper functions, but any new features you implement must eventually make their way into the vector that is returned by extractFeatures.

If you choose to go down this route, your Fun.md should explicitly state the final accuracy you got after making your changes (which will hopefully be higher than 69%).

Adding Features

  • Pig speaking

    I know what we can do! MORE features must be better, right?

It is true that adding new features can improve accuracy, but you have to be clever about what features you add. Adding features that provide redundant information probably won't help very much. Instead, you want to think about information that is not directly encoded by the existing feature. Here's one example: right now, our classifier has no way of knowing the number of vowels in an input. Suppose an input had the features "aoi" and "itu". How many vowels are in the original input? You might say five: a, o, i, i, and u. But we don't know whether the aoi and itu were overlapping in this case. If they were (e.g., the input originally looked something like "aoitu") then our count would have been off by one. So, writing code to separately identify all the individual vowels in the input, and adding them as features, would provide additional information beyond the already existing features. And that information could be relevant if, for example, you have reason to think that Pokemon names and software names tend to have a different frequency of vowels.

Removing Features

While this may be counterintuitive, removing features can also help improve accuracy. This is because if a specific feature is very rare, it may actually just introduce noise that confuses the model. For instance, if the Pokemon "Jangmo-o" is in the training data, the model might learn to associate "ngm" with Pokemon, as that feature is unlikely to be seen anywhere else. But this intuitively feels like it's circumstantial evidence at best; if we then saw a new input "opengm" we would probably want the model to guess this is software, without being distracted by the presence of "ngm".

So, what you could do is check each individual feature and see how rare it is. If a feature is rare (i.e., its count is lower than some arbitrary threshold) you can just discard it (i.e., don't insert it into the vector). Note that to make this work, you will probably need to set up a bit of extra infrastructure involving a second HashMultiset that gets populated in a "first pass" over the training data before the main training loop starts (since the logic requires that you know the counts of all the features during training).

(When logged in, completion status appears here.)