CS 151: Programming Assignment 5 [PAIR OPTION]
Hidden Markov Models for Sketch Recognition [90 points +10 extra credit]
due Thursday, April 9, 11:55pm
Provided files for this assignment:
- code.zip -- The code to start with for this assignment
- data.zip -- Sketch training/testing data
- domain.txt -- the domain file for the labeler
- labeler.zip -- A handy utility for viewing the sketches (only runs on Windows; see instructions below)
The purpose of this assignment is twofold:
- To implement a core Hidden Markov Model algorithm: the viterbi algorithm
- To explore how to use an HMM to solve the sketch recognition problem
In this assignment you will be implementing a Hidden Markov Model (HMM)
to solve a core problem in sketch recognition: how to determine whether
a stroke is part of a drawing or part of some text.
You will need to submit two files:
- strokeHMM.py: The modified version of the HMM file I provide
- results.txt: A text file with an analysis of your results, description of your features, etc.
The motivation behind this assignment comes from a recent paper in the sketch recognition/machine learning literature:
C. M. Bishop, M. Svensén, G. E. Hinton. Distinguishing Text from Graphics in On-line Handwritten Ink 2004 Proceedings International Workshop on Frontiers in Handwriting Recognition, IWFHR-9 F. Kimura and H. Fujisawa 142--147
The authors of this paper propose to use an HMM to distinguish between
text and graphics in a sketch. Their model is slightly more
complex than ours in that they use the output of a neural network as
input to the observation layer to the HMM. However, the spirit is
almost identical. I recommend that you read this paper (though it
is in no way required).
Background
Our model is a standard HMM, with one node at the hidden layer (the state node, either drawing or text) and one node at the observation layer (the stroke node, whose value represents one or more stroke features). Our goal is, given a sequence of values for the stroke nodes from t=1...n, to estimate the most likely sequence of state
values for the same time frame. This is a most-likely-explanation
problem, as we discussed in class, and you can solve it using the
viterbi algorithm. Most of the code to solve this problem is
provided for you (and described below). You will need to fill in
only a small part (also described below).
Defining the Model
First, let's look at the probability models that define the HMM.
The first model to consider is P(state_0), the prior probability over
the first stroke's state (either text or drawing). This
probability distribution is represented as a simple table, e.g.:
'drawing': 0.6
'text': 0.4
Meaning that there is a 0.6 probability that the first stroke is a
drawing stroke, and a 0.4 probability that the first stroke is a text
stroke.
The next model is our transition model, P(state_t+1 | state_t ).
Because our state values are simple and discrete, this
conditional probability distribution (CPD) can also be
represented as a table, e.g.:
'drawing': {'drawing': 0.8, 'text': 0.2 }
'text': {'drawing': 0.5, 'text':0.5 }
Meaning that there is a 0.8 probability that a drawing stroke will be
followed by another drawing stroke, and a 0.5 probability that a text
stroke will be followed by a text stroke.
Finally, we must consider the evidence model. The evidence model
is a bit more complicated. A "stroke" is not a nice, discrete
value. In fact, it's a list of points, represented as (x, y,
time) tuples. For this reason, to reason about strokes in our
HMM, we must convert each stroke into one or more features.
For example, let's consider only one feature, the stroke's length.
In this case, the value of stroke_t is simply the length of the
stroke drawn at time t, i.e., just a number. So, P(stroke_t |
state_t ) is: it is a distribution over the possible stroke lengths,
given the classification (state) of that stroke.
Problem #1: Continuous features
Unfortunately, however, length is not a discrete feature, so our CPD
cannot be represented with a simple table. There are two
solutions to this problem, both of which you may choose to explore.
The first solution is to simply turn length into a discrete
feature by "binning" it. For example, we might define all strokes
less than 300 units in length to be "short" and all other strokes to be
"long". Given this discretization, we can now represent this CPD
as a table, e.g.
'drawing': {stroke=short: 0.2, stroke=long: 0.8}
'text': {stroke=short: 0.6, stroke=long: 0.4}
The second solution is to leave the feature as a continuous valued
feature, and fit a Gaussian distribution over the values of the
features, so we use a Gaussian to represent the CPD rather than a table.
Problem #2: Multiple features
The second problem arises when we decide to use more than one feature
to represent our strokes. If we consider all values of all
features in combination, the feature space becomes very large and very
complicated, and it will be hard to model in our CPD. To simplify
the feature space, we will make the feature independence assumption.
That is, we assume that given the state, all features are
independent from one another. This assumption allows us to
represent the CPD for multiple features as:
P(stroke | class ) = P(f1, f2, ..., fn | class ) = P(f1 | class) * P(f2 | class) *...* P(fn | class)
Now that you understand the basic model, you can rest assured that all
of the code to implement this model is already provided to you.
So, let's take a look at the code...
Provided Code
I have provided code that implements most of the functionality of this
assignment, to make your job of playing with different features a
little easier and more fun. Here's an overview of the provided
code. Please see the comments in the code for more details.
class HMM
Defines the basic functionality of an HMM. Algorithms for
learning the CPDs for discrete and continuous features are already
implemented. You only need to write the label
function, which applies the viterbi algorithm to return the most likely
sequence of states, given a sequence of stroke data (features).
class strokeLabeler
Defines a stroke labeler class. There are utility functions to
load and save files in an XML format that can be opened in the labeler
(see below). There are also functions to train the HMM provided
for you. You need to know how to use these functions.
The functions in strokeLabeler that you will need to modify are the following:
- The constructor: you need to modify self.featureNames, self.contOrDisc, and self.numFVals as you play with different features
- featurefy: This
function converts a stroke into a dictionary of features. You'll
need to add to it as you play with different features.
- featureTest: This
function is handy for testing and debugging your feature code and you
may want to modify it as you play with different features
If you want you can modify others. You will also need to add more functions to compute recognition statistics.
class Stroke
Defines a stroke class. A stroke is a list of points each with an
x, y and time value. It is in this class that you should add your
function(s) to complete new feature(s).
The Assignment
Part 1: Implement the Viterbi algorithm [40 points]
You first task is to complete the label function in the HMM
class. This function should simply implement the Viterbi
algorithm that we discussed in class to find the most likely sequence
of labels, given a sequence of strokes (represented as features).
CAUTION: Python code for the
Viterbi algorithm is available on Wikipedia. Therefore, referring
to the Wikipedia Viterbi site is a violation of the honor code (as is
looking at code anywhere else on the web).
IMPORTANT: To debug your algorithm, I suggest you DO NOT run it on
stroke data. It will be impossible to tell if you are correct.
You should instead add the necessary code to hand-construct an HMM that
we have looked at in class examples (or one from the book) and verify
that your algorithm gives you the same output. If you do not
implement this algorithm correctly the rest of the assignment will
simply be frustrating.
Part 2: Build and test your best classifier [50 points + 10 E.C.]
Once you have Viterbi running, take a minute to train and test your
classifier using the feature provided in the code (stroke length,
binned into two discrete values). I suggest you use only a subset
of the provided files for training. Start small (5 or so), and
you'll play around with the number of training files below.
This classifier does OK on some files, and not so hot on others.
But how can you tell exactly how well it does? Two ways:
visually (optional) and quantitatively (required).
Optional: Visualizing your classification.
If you want to look at the sketches you are classifying, and the
classification your algorithm produced, you can use the labeler
software you downloaded above. Unfortunately, this software only
runs on Windows, and you need to be able to install some libraries to
make it run, so you won't be able to run it on lab computers. If
you do not have access to a Windows machine (yours or a friend's), you
can come borrow a tablet computer from me.
To run the labeler, unzip the file you downloaded, and download the domain.txt file from above.
If you are running on standard Windows (not a tablet), you will also need to install the following: http://www.microsoft.com/downloads/details.aspx?familyid=B46D4B83-A821-40BC-AA85-C9EE3D6E9699&displaylang=en
Once you have unzipped the software, go into the
2008-07-30->Programs->Labeler folder and run Labeler.exe (the
shortcut from the outer directory does not work).
Once inside the labeler, click "Load Domain" and choose the domain.txt
file you downloaded above (probably actually called text_drawing.txt or
something similar). Then click "Open Sketch" and load any xml
sketch file. If you load a file that has been labeled by your
HMM, you should see text strokes in blue and drawing strokes in red.
Required: Quantifying the classification accuracy
To quantify the classification accuracy in this case, we will use a
structure called a "confusion matrix". This is simply a matrix
that lists how different strokes were classified, given their true
labels. For example:
Classified As Drawing Text Percent Correct
True Label
Drawing
30
10
75%
Text
5
20 80%
That is, out of 40 drawing strokes total, 30 were correctly classified
and 10 were not. Out of 25 text strokes, 20 were classified
correctly, and 5 were not.
In the strokeLabeler class, write a function called
confusion(trueLabels, classifications) that takes a list of true labels
and a list of labels output by the classifier for the same set of
strokes (in order). It should return a dictionary with the
following structure (based on the table above):
{'drawing': {'drawing': 30, 'text': 10}, 'text': {'drawing': 5, 'text': 20}}
That is, the entries in the main dictionary should be the true labels,
and each true label should map to another dictionary giving
classification numbers for those strokes.
Exploring More Features
Finally you should build your classifier by exploring and including at least one more feature.
WARNING: You are going to have to compare the basic classifier to your
best classifier. So make a COPY of your strokeHMM.py file and
call it strokeHMMbasic.py so you can always run this original
classifier. Put your modifications in your strokeHMM.py file.
Note that I have already provided code to calculate the curvature of a
stroke, so you might choose to try to use curvature to improve your
classifier. However, you are encouraged to try adding another (or
more than one) feature. Possible features you might want to
implement include:
- Distance from side of sketch (or top/bottom of sketch)
- Bounding box area
- Bounding box height/width ration
- Drawing speed (max, min, average)
- Proximity to nearest neighbor
This list is just a suggestion. Feel free to get creative.
You also might want to experiment with making features discrete or
continuous, and you will certainly want to be methodical about how to
discretize your features. If you get REALLY ambitious you might
try modeling continuous features with a more complex distribution than
a simple Gaussian or relaxing the independence assumption between
features (this will require modifying a substantial amount of the
provided code.)
You will submit your best classifier as your strokeHMM.py file.
Once again, I will hold a tournament (time permitting) to
determine the best classifier, but the outcome of this tournament will
not affect your grade. Particularly ambitious approaches (as
described below in the writeup) will receive up to 10 points of extra
credit.
Writeup
Finally, address each of the following in your results.txt file:
- Discuss the process you use to build your best classifier. Be sure to address each of the following:
- Why did you choose the feature(s) you did?
- How did you decide continuous vs. discrete?
- How did you determine thresholds for discrete features?
- Choose a set of training files and a set of test files (for this
assignment it's OK if the sets overlap). Give the confusion
matrix that results from running the naive (basic) classifier and the
confusion matrix that results from running your best classifier.