Claremont Graduate University
Data Science Programming (IST 380)
Fall 2013

Hw #6: Forests (and clouds!)

Back to the IST380 homepage

HW 6 ~ due Tuesday, Mar. 26, 2013


(linked!)    Here are the in-class examples to make it easy to follow along...

In particular, after each comment is a block of examples -- you'll be able to copy-and-paste those into R's console in order to run them. You may need to install packages as they appear, too... .

As usual, for this assignment, please submit a zipped folder named that contains a few files: one named pr1.txt that includes your interactive session (or history) working through Chapter 13 of the Data Science book, entitled Word Perfect. That chapter guides you through a first-pass text analysis of Tweet data and then builds a word-cloud out of the terms.
Here is a full list of the files, including some extra-credit options:

  1. pr1.txt should be your history (or console) interactions for the book's Chapter 13. This introduces the tm (text mining) package to the R libraries we've used and shows some of its most important functions. For fun, it also introduces the wordcloud package!

    Also, include a file named pr1.R, which should have your definition for an R function named make_cloud, which should take in an input search term and then, following the suggestions and the functions in the chapter, it should create a wordcloud from Tweets gathered from that search term. This is the "challenge" posed at the end of the chapter.

    In addition, please include two (or more) image files with pictures of your textclouds from different search terms. [Optional] upto +5 points are available for adding artistic touches, such as color, to your word clouds (you may have to investigate the wordcloud documentation to do this...).

  2. For problem 2, you will want to return to the Titanic-survivor dataset of a few weeks ago. It is linked here in train742.csv. For this week's assignment, however, you should use R's randomForest package to create a forest-based predictive model in pr2.R for whether or not a passenger would have survived the sinking of the Titanic. Your predictive model should
    • first remove from the dataset any columns that are not useful or usable - for example, the randomForest package will not handle categorical variables with more than 32 distinct levels, e.g., names.
    • change the data formatting so that it's as useful as possible, e.g., making sure the survived column is considered a factor variable! Other changes that are helpful to you are certainly welcome!
    • it should impute the NA variables with values, using the rfImpute functions, as we did with the iris data in class
    • Finally, it should create a random forest based on the Titanic data; then (either together or in a separate script) you should run that forest against the original data and indicate what percentage of the passengers it correctly classifies.