IST 380 homework page

Claremont Graduate University
Data Science Programming (IST 380)
Fall 2013

Hw #7: Clusters!

Back to the IST380 homepage

HW 7 ~ due Saturday, Apr. 6, 2013
	(linked!) Here are the in-class examples In particular, that file includes some examples of the `kmeans` and `hclust` functions -- as well as some special-purpose functions -- that are part of this week's second problem... As usual, for this assignment, please submit a zipped folder named `hw7.zip` that contains a few files: one named `pr1.txt` that includes your interactive session (or history) working through Chapter 14 of the Data Science book, entitled Word Perfect. That chapter guides you through two examples in which you connect R to widely used external software. Here is a full list of the files, including some extra-credit options: `pr1.txt` should be your history (or console) interactions for the book's Chapter 14. This chapter introduces the ways in which R can interact with other large software resources, such as MySQL databases and Hadoop-based computational clusters. For this problem, there are two unlabeled datasets which you'll be analyzing using the unsupervised learning techniques (clustering) we looked at last week. Here are those two datasets: mystery_digits1.csv mystery_digits2.csv For each dataset, your task is to Run k-means analysis with several different numbers-of-centers on the data and plot the amount of variance explained for each of those numbers. Run a hierarchical clustering with at least three different cluster-combining approaches or "linkages": single-linkage, called via `hclust( dists, method="single" )`, complete-linkage, which is `method="complete"`, or variance-minimizing linkage, which is `method="ward"`. Plot each of these. In a Word (or other) document, include these k-means and hierarchical plots. In addition, include an explanation of which linkage-type (within hierarchical clustering) sems most useful for these data sets. Finally, *explain your guesses for the number of different clusters in each of these two datasets. Because the goal of this problem is to use `kmeans` and `hclust`, you should certainly feel free to look at the digits themselves (using the function `plot.data`) provided last week. Sometimes additional information is available that can help with clustering decisions. In this case the `k` clusters within each dataset are approximately the same size (though not the same size across the two datasets). This would occur, for example, if you knew that the data were from some process that sampled the different possibilities nearly uniformly. Submit your `doc` file in the `hw7.zip` archive. Extra!* Extra-credit of up to +5 points is available for finding one of the mislabeled digits for each class you determine is present in the two datasets above. For each one, include a sentence-or-two of explanation of why the clustering algorithms may have misclassified it. In order to know (for sure) what has been misclassified, you'd need to know the correct labels for each digit. Since the digits are clear (by looking at them), it's enough simply to say that in the datasets, the rows themselves are also clustered per digit, i.e., all of the first-digit rows appear, then all of the second-digit rows appear, and so on... .