Harvey Mudd College
Computer Science 153
Problem Set 3
Due Friday, October 14, 5pm

Back to Problem Set 3, top-level page

Back to Problem Set 3, Section 1: Classification

Part 2: PCA (and, optionally, Nearest Neighbor)

This section asks you to explore PCA for optical character recognition. Necessary data and scripts you will need for this section of the assignment can be found in the directory:

/cs/cs153/digits

In that directory you will find the following files:

ps3data.mat: A matlab file containing images of 10000 handwritten digits. They are arranged into two sets: ps3train and ps3test. The images are 28x28, grayscale images, arranged in columns (e.g., the matrix "ps3train" is a 784x5000 matrix). Also included in this file are labels for each image in the training set and the test set. You can load this data into matlab using matlab's "load" command.
projectpca.m: A matlab function for reducing data using pca.
show_digit.m: A function that can be used for viewing digits. (Be sure to cast the input to this function to uint8 or you might see strange images).
show_digit_stretched.m: A function used for displaying the "eigendigits".

Your tasks in this section are the following:

Compress the training images using projectpca. Use 50 principle components.
Show the mean image and the first 5 principle components. Use 'show_digit_stretched' to visualize the principle components, and remember to cast your doubles to uint8s to display them.
Results:
Plot the entire training set, projected onto the first principle components. This can be done using plot(p(1,:),p(2,:),'k.') where p is the compressed data. Can you tell that there are different classes here? How many would you guess?
Results:
Plot the entire training set, projected onto the first 3 principle components. This can be done using plot3(p(1,:),p(2,:),p(3,:),'k.'); rotate3d on. How many classes do you think there are now? Now plot each class as a different color. Were you surprised?
Results:
Compare the PCA reconstruction and the original image for the first 5 digits in the training set. A reconstruction can be obtained by:
```
R = mean + sum(p_i*E_i*sqrt(lambda_i)) for i=1 to 50.
```
Where p_i is the ith coefficient of the projection, lambda_i is the ith eigenvalue and E_i is the ith eigenvector or principle component.
What do you notice?
Results:

Optional Extension: Implementing a K-nearest neighbor classifier
To get started, write the matlab function:
```
function class = nearest(input, data, target);
```
To classify input examples using a nearest neighbor classifier on the given data. input is a DxM matrix where each column represents an example in D dimensions. data is a DxM matrix, containing N training examples, arranged in columns, of D dimensions each. target is a 1xN matrix, containing the correct classification for each of the N training examples. class is a 1xM matrix of classifications for the data points.
For each training example, your function needs to compute the (squared) Euclidean distance between that example and each data point, then locate the closest data point and classify that input point according to the classification of the data point.
NOTE: You definitely want to use matrix manipulation rather than loops. The whole function can be written in about 6 lines without using any loops. However, in this case you probably do NOT want to simply copy every input point N times, as you may have done in k-means. Instead, note that the squared Euclidean distance between two points, a and b, can be written as sum((a_i - b_i)(a_i - b_i)) for all i. This is the same as sum((a_i)^2) - sum(2(a_i)(b_i)) + sum((b_i)^2). This expression can be nicely translated to matlab code that doesn't produce enormous data structures in performing its calculations.
When writing your classifier, you should test it out on the image data. (If you use all the test images, it will likely run pretty slow. You might want to use a subset).
Once your classifier works, use it to classify reduced versions of the data, obtained by projecting the training and test data down onto the eigenvectors of the test data only. To compress the test images using the principle components found from the training set, just pass in eigenvalues and eigenvectors to the projectpca function as arguments.
Each time you run the classifier, keep track of the total error (number of misclassified examples). Plot the error versus the number of principle components used to illustrate how reducing the data impacts classification accuracy. Also comment (qualitatively) on timing differences you noticed. What is a reasonable selection for the number of principle components to preserve to perform reasonable classification?
Possible Extensions

Remember that you need to pursue only one extension in one part of the assignment, and it can be anything you'd like to investigate. These are only suggestions.
Back to PS3 Index

Harvey Mudd College Computer Science 153 Problem Set 3 Due Friday, October 14, 5pm

Back to Problem Set 3, top-level page

Back to Problem Set 3, Section 1: Classification

Part 2: PCA (and, optionally, Nearest Neighbor)

Optional Extension: Implementing a K-nearest neighbor classifier

Possible Extensions

Back to PS3 Index

Harvey Mudd College
Computer Science 153
Problem Set 3
Due Friday, October 14, 5pm