This section asks you to explore PCA for optical character recognition. Necessary data and scripts you will need for this section of the assignment can be found in the directory:
/cs/cs153/digitsIn that directory you will find the following files:
R = mean + sum(p_i*E_i*sqrt(lambda_i)) for i=1 to 50.Where p_i is the ith coefficient of the projection, lambda_i is the ith eigenvalue and E_i is the ith eigenvector or principle component.
What do you notice?
function class = nearest(input, data, target);To classify input examples using a nearest neighbor classifier on the given data. input is a DxM matrix where each column represents an example in D dimensions. data is a DxM matrix, containing N training examples, arranged in columns, of D dimensions each. target is a 1xN matrix, containing the correct classification for each of the N training examples. class is a 1xM matrix of classifications for the data points.
For each training example, your function needs to compute the (squared) Euclidean distance between that example and each data point, then locate the closest data point and classify that input point according to the classification of the data point.
NOTE: You definitely want to use matrix manipulation rather than loops. The whole function can be written in about 6 lines without using any loops. However, in this case you probably do NOT want to simply copy every input point N times, as you may have done in k-means. Instead, note that the squared Euclidean distance between two points, a and b, can be written as sum((a_i - b_i)(a_i - b_i)) for all i. This is the same as sum((a_i)^2) - sum(2(a_i)(b_i)) + sum((b_i)^2). This expression can be nicely translated to matlab code that doesn't produce enormous data structures in performing its calculations.
When writing your classifier, you should test it out on the image data. (If you use all the test images, it will likely run pretty slow. You might want to use a subset).
Once your classifier works, use it to classify reduced versions of the data, obtained by projecting the training and test data down onto the eigenvectors of the test data only. To compress the test images using the principle components found from the training set, just pass in eigenvalues and eigenvectors to the projectpca function as arguments.
Each time you run the classifier, keep track of the total error (number of misclassified examples). Plot the error versus the number of principle components used to illustrate how reducing the data impacts classification accuracy. Also comment (qualitatively) on timing differences you noticed. What is a reasonable selection for the number of principle components to preserve to perform reasonable classification?
Remember that you need to pursue only one extension in one part of the assignment, and it can be anything you'd like to investigate. These are only suggestions.