Here are the in-class examples
In particular, that file includes some examples of the kmeans
and hclust functions -- as well as some special-purpose functions
-- that are part of this week's second problem...
As usual, for this assignment, please submit a zipped folder named hw7.zip
that contains a few files: one named pr1.txt that includes
your interactive session (or history)
working through Chapter 14 of the Data Science book, entitled Word
Perfect. That chapter guides
you through two examples in which you connect R to
widely used external software.
Here is a full list of the files, including some extra-credit options:
- pr1.txt should be your history (or console)
interactions for the book's Chapter 14. This chapter
introduces the ways in which R can interact with other
large software resources, such as MySQL databases and Hadoop-based
For this problem, there are two unlabeled datasets which you'll be
analyzing using the unsupervised learning techniques (clustering)
we looked at last week. Here are those two datasets:
For each dataset, your task is to
- Run k-means analysis with several different numbers-of-centers on
the data and plot the amount of variance explained for each of those
- Run a hierarchical clustering with at least three different
cluster-combining approaches or "linkages": single-linkage, called
via hclust( dists, method="single" ), complete-linkage, which
is method="complete", or variance-minimizing linkage,
which is method="ward". Plot each of these.
- In a Word (or other) document, include these k-means and
hierarchical plots. In addition, include an explanation of which
linkage-type (within hierarchical clustering) sems most useful for these
- Finally, explain your guesses for the number of different clusters
in each of these two datasets. Because the goal of this problem is to use
kmeans and hclust, you should certainly feel free to
look at the digits themselves (using the function plot.data)
provided last week.
- Sometimes additional information is available that can help with
clustering decisions. In this case the k clusters within each
dataset are approximately the same size (though not the same size across
the two datasets). This would occur, for example, if you knew that the
data were from some process that sampled the different possibilities
Submit your doc file in the hw7.zip archive.
- Extra! Extra-credit of up to +5
points is available for finding one of the mislabeled digits for
each class you determine is present in the two datasets above. For each
one, include a sentence-or-two of explanation of why the
clustering algorithms may have misclassified it.
In order to know (for sure) what has been misclassified, you'd
need to know the correct labels for each digit. Since the digits are
clear (by looking at them), it's enough simply to say that in the
datasets, the rows themselves are also clustered per digit,
i.e., all of the first-digit rows appear, then all of the second-digit
rows appear, and so on... .