CS 151: Programming Assignment 5

CS 151: Programming Assignment 5 [PAIR OPTION]

Hidden Markov Models for Sketch Recognition [90 points +10 extra credit]
due Thursday, April 9, 11:55pm

Provided files for this assignment:

code.zip -- The code to start with for this assignment
data.zip -- Sketch training/testing data
domain.txt -- the domain file for the labeler
labeler.zip -- A handy utility for viewing the sketches (only runs on Windows; see instructions below)

The purpose of this assignment is twofold:

To implement a core Hidden Markov Model algorithm: the viterbi algorithm
To explore how to use an HMM to solve the sketch recognition problem

In this assignment you will be implementing a Hidden Markov Model (HMM) to solve a core problem in sketch recognition: how to determine whether a stroke is part of a drawing or part of some text.

You will need to submit two files:

strokeHMM.py: The modified version of the HMM file I provide
results.txt: A text file with an analysis of your results, description of your features, etc.

The motivation behind this assignment comes from a recent paper in the sketch recognition/machine learning literature:

C. M. Bishop, M. Svensén, G. E. Hinton. Distinguishing Text from Graphics in On-line Handwritten Ink 2004 Proceedings International Workshop on Frontiers in Handwriting Recognition, IWFHR-9 F. Kimura and H. Fujisawa 142--147

The authors of this paper propose to use an HMM to distinguish between text and graphics in a sketch. Their model is slightly more complex than ours in that they use the output of a neural network as input to the observation layer to the HMM. However, the spirit is almost identical. I recommend that you read this paper (though it is in no way required).

Background

Our model is a standard HMM, with one node at the hidden layer (the state node, either drawing or text) and one node at the observation layer (the stroke node, whose value represents one or more stroke features). Our goal is, given a sequence of values for the stroke nodes from t=1...n, to estimate the most likely sequence of state values for the same time frame. This is a most-likely-explanation problem, as we discussed in class, and you can solve it using the viterbi algorithm. Most of the code to solve this problem is provided for you (and described below). You will need to fill in only a small part (also described below).

Defining the Model

First, let's look at the probability models that define the HMM.

The first model to consider is P(state_0), the prior probability over the first stroke's state (either text or drawing). This probability distribution is represented as a simple table, e.g.:
    'drawing': 0.6
    'text': 0.4
Meaning that there is a 0.6 probability that the first stroke is a drawing stroke, and a 0.4 probability that the first stroke is a text stroke.

The next model is our transition model, P(state_t+1 | state_t ). Because our state values are simple and discrete, this conditional probability distribution (CPD) can also be represented as a table, e.g.:
    'drawing': {'drawing': 0.8, 'text': 0.2 }
    'text': {'drawing': 0.5, 'text':0.5 }
Meaning that there is a 0.8 probability that a drawing stroke will be followed by another drawing stroke, and a 0.5 probability that a text stroke will be followed by a text stroke.

Finally, we must consider the evidence model. The evidence model is a bit more complicated. A "stroke" is not a nice, discrete value. In fact, it's a list of points, represented as (x, y, time) tuples. For this reason, to reason about strokes in our HMM, we must convert each stroke into one or more features.

For example, let's consider only one feature, the stroke's length. In this case, the value of stroke_t is simply the length of the stroke drawn at time t, i.e., just a number. So, P(stroke_t | state_t ) is: it is a distribution over the possible stroke lengths, given the classification (state) of that stroke.

Problem #1: Continuous features
Unfortunately, however, length is not a discrete feature, so our CPD cannot be represented with a simple table. There are two solutions to this problem, both of which you may choose to explore. The first solution is to simply turn length into a discrete feature by "binning" it. For example, we might define all strokes less than 300 units in length to be "short" and all other strokes to be "long". Given this discretization, we can now represent this CPD as a table, e.g.
    'drawing': {stroke=short: 0.2, stroke=long: 0.8}
    'text': {stroke=short: 0.6, stroke=long: 0.4}

The second solution is to leave the feature as a continuous valued feature, and fit a Gaussian distribution over the values of the features, so we use a Gaussian to represent the CPD rather than a table.

Problem #2: Multiple features
The second problem arises when we decide to use more than one feature to represent our strokes. If we consider all values of all features in combination, the feature space becomes very large and very complicated, and it will be hard to model in our CPD. To simplify the feature space, we will make the feature independence assumption. That is, we assume that given the state, all features are independent from one another. This assumption allows us to represent the CPD for multiple features as:

P(stroke | class ) = P(f1, f2, ..., fn | class ) = P(f1 | class) * P(f2 | class) *...* P(fn | class)

Now that you understand the basic model, you can rest assured that all of the code to implement this model is already provided to you. So, let's take a look at the code...

Provided Code

I have provided code that implements most of the functionality of this assignment, to make your job of playing with different features a little easier and more fun. Here's an overview of the provided code. Please see the comments in the code for more details.

class HMM
Defines the basic functionality of an HMM. Algorithms for learning the CPDs for discrete and continuous features are already implemented. You only need to write the label function, which applies the viterbi algorithm to return the most likely sequence of states, given a sequence of stroke data (features).

class strokeLabeler
Defines a stroke labeler class. There are utility functions to load and save files in an XML format that can be opened in the labeler (see below). There are also functions to train the HMM provided for you. You need to know how to use these functions.

The functions in strokeLabeler that you will need to modify are the following:

The constructor: you need to modify self.featureNames, self.contOrDisc, and self.numFVals as you play with different features
featurefy: This function converts a stroke into a dictionary of features. You'll need to add to it as you play with different features.
featureTest: This function is handy for testing and debugging your feature code and you may want to modify it as you play with different features

If you want you can modify others. You will also need to add more functions to compute recognition statistics.

class Stroke
Defines a stroke class. A stroke is a list of points each with an x, y and time value. It is in this class that you should add your function(s) to complete new feature(s).

The Assignment

Part 1: Implement the Viterbi algorithm [40 points]

You first task is to complete the label function in the HMM class. This function should simply implement the Viterbi algorithm that we discussed in class to find the most likely sequence of labels, given a sequence of strokes (represented as features).

CAUTION: Python code for the Viterbi algorithm is available on Wikipedia. Therefore, referring to the Wikipedia Viterbi site is a violation of the honor code (as is looking at code anywhere else on the web).

IMPORTANT: To debug your algorithm, I suggest you DO NOT run it on stroke data. It will be impossible to tell if you are correct.
You should instead add the necessary code to hand-construct an HMM that we have looked at in class examples (or one from the book) and verify that your algorithm gives you the same output. If you do not implement this algorithm correctly the rest of the assignment will simply be frustrating.

Part 2: Build and test your best classifier [50 points + 10 E.C.]

Once you have Viterbi running, take a minute to train and test your classifier using the feature provided in the code (stroke length, binned into two discrete values). I suggest you use only a subset of the provided files for training. Start small (5 or so), and you'll play around with the number of training files below.

This classifier does OK on some files, and not so hot on others. But how can you tell exactly how well it does? Two ways: visually (optional) and quantitatively (required).

Optional: Visualizing your classification.

If you want to look at the sketches you are classifying, and the classification your algorithm produced, you can use the labeler software you downloaded above. Unfortunately, this software only runs on Windows, and you need to be able to install some libraries to make it run, so you won't be able to run it on lab computers. If you do not have access to a Windows machine (yours or a friend's), you can come borrow a tablet computer from me.

To run the labeler, unzip the file you downloaded, and download the domain.txt file from above.
If you are running on standard Windows (not a tablet), you will also need to install the following: http://www.microsoft.com/downloads/details.aspx?familyid=B46D4B83-A821-40BC-AA85-C9EE3D6E9699&displaylang=en

Once you have unzipped the software, go into the 2008-07-30->Programs->Labeler folder and run Labeler.exe (the shortcut from the outer directory does not work).

Once inside the labeler, click "Load Domain" and choose the domain.txt file you downloaded above (probably actually called text_drawing.txt or something similar). Then click "Open Sketch" and load any xml sketch file. If you load a file that has been labeled by your HMM, you should see text strokes in blue and drawing strokes in red.

Required: Quantifying the classification accuracy

To quantify the classification accuracy in this case, we will use a structure called a "confusion matrix". This is simply a matrix that lists how different strokes were classified, given their true labels. For example:

        Classified As    Drawing        Text        Percent Correct
True Label
    Drawing                    30                10            75%
    Text                            5               20           80%

That is, out of 40 drawing strokes total, 30 were correctly classified and 10 were not. Out of 25 text strokes, 20 were classified correctly, and 5 were not.

In the strokeLabeler class, write a function called confusion(trueLabels, classifications) that takes a list of true labels and a list of labels output by the classifier for the same set of strokes (in order). It should return a dictionary with the following structure (based on the table above):

        {'drawing': {'drawing': 30, 'text': 10}, 'text': {'drawing': 5, 'text': 20}}

That is, the entries in the main dictionary should be the true labels, and each true label should map to another dictionary giving classification numbers for those strokes.

Exploring More Features
Finally you should build your classifier by exploring and including at least one more feature.

WARNING: You are going to have to compare the basic classifier to your best classifier. So make a COPY of your strokeHMM.py file and call it strokeHMMbasic.py so you can always run this original classifier. Put your modifications in your strokeHMM.py file.

Note that I have already provided code to calculate the curvature of a stroke, so you might choose to try to use curvature to improve your classifier. However, you are encouraged to try adding another (or more than one) feature. Possible features you might want to implement include:

Distance from side of sketch (or top/bottom of sketch)
Bounding box area
Bounding box height/width ration
Drawing speed (max, min, average)
Proximity to nearest neighbor

This list is just a suggestion. Feel free to get creative.

You also might want to experiment with making features discrete or continuous, and you will certainly want to be methodical about how to discretize your features. If you get REALLY ambitious you might try modeling continuous features with a more complex distribution than a simple Gaussian or relaxing the independence assumption between features (this will require modifying a substantial amount of the provided code.)

You will submit your best classifier as your strokeHMM.py file. Once again, I will hold a tournament (time permitting) to determine the best classifier, but the outcome of this tournament will not affect your grade. Particularly ambitious approaches (as described below in the writeup) will receive up to 10 points of extra credit.

Writeup
Finally, address each of the following in your results.txt file:

Discuss the process you use to build your best classifier. Be sure to address each of the following:

Why did you choose the feature(s) you did?
How did you decide continuous vs. discrete?
How did you determine thresholds for discrete features?

Choose a set of training files and a set of test files (for this assignment it's OK if the sets overlap). Give the confusion matrix that results from running the naive (basic) classifier and the confusion matrix that results from running your best classifier.