Claremont Graduate University
Data Science Programming (IST 380)
Fall 2013

Hw #5: Strings and trees

Back to the IST380 homepage

HW 5 ~ due Tuesday, Mar. 12, 2013


(linked!)    Here are the in-class examples to make it easy to follow along...

In particular, after each comment is a block of examples -- you'll be able to copy-and-paste those into R's console in order to run them. You may need to install packages as they appear, but this hasn't caused a problem so far... .

As usual, for this assignment, please submit a zipped folder named that contains a few files: one named pr1.txt that includes your interactive session (or history) working through chapter 12 of the Data Science book. (String-processing to analyze Tweets)
Here is a full list of the files, including some extra-credit options:

  1. pr1.txt should be your history (or console) interactions for the book's Chapter 12. This adds string-processing to your modeling of Twitter feeds -- including extracting retweets, URLs and (optionally) hashtags.

    In addition, please include a file named pr1.R, which should have your definition for an R function named retweeters, which should take in a data frame of Tweets as input and it should output a list of unique source names of retweeted messages. This is the "challenge" posed at the end of the chapter.

    [Optional] If you'd like to try the suggestion extension at the end of the chapter, you're welcome to! It's totally optional and is worth up to +5 points of extra credit. It's to write a function, call it hashes, that takes in a dataframe of Tweets and returns a list of all of the unique hashtags found in those Tweets, along with the number of times each one was found. This can be a list of two vectors or, perhaps more naturally, a data frame as a result.

  2. For problem 2, you will want to return to the Titanic-survivor dataset of a few weeks ago. It is linked here in train742.csv. For this week's assignment, however, you should use R's tree package to create a tree-based predictive model in pr2.R for whether or not a passenger would have survived the sinking of the Titanic. You should include an R file that contains your model, along with a pr2.txt or pr2.doc file that describes how you arrived at your model.

    Your analysis should include some of the important facets we touched on in week 5's class, including
    • a tree that is based on a subset of the Titanic variables (you may use them all or you may cull them away or convert them to a more suitable form)
    • your tree should be checked by cross-validation in order to determine how many leaves it could usefully have in order to void overfitting the data
    • your tree should be pruned to the size you choose (based on the cross-validation analysis)
    • You should include a table of how many of the 742 test observations are correctly (and incorrectly) classified by the tree
    • We will run your tree on both that test data and some additional training data that's not included in the test set... .

  3. [Optional]   The third problem this week is also entirely optional (and worth up to +5 points of extra-credit, similar to the hashes function). It is to build a logistic regression model based on the Titanic-survival dataset that results in another predictor for Titanic survival.

    My hunch is that the tree-based models will work better than the logistic regression, but we'll see if that's really true. Plus, there is certainly no requirement that your logistic model work better than the tree-based models (but it should work better than pure chance!)