Here are the in-class examples
to make it easy to follow along...
In particular, after each comment is a block of examples -- you'll be able
to copy-and-paste those into R's console in order to run them. You
may need to install packages as they appear, but this hasn't caused a
problem so far... .
As usual, for this assignment, please submit a zipped folder named hw5.zip
that contains a few files: one named pr1.txt that includes
your interactive session (or history)
working through chapter 12 of the Data Science book.
(String-processing to analyze Tweets)
Here is a full list of the files, including some extra-credit options:
- pr1.txt should be your history (or console)
interactions for the book's Chapter 12. This adds string-processing
to your modeling of Twitter feeds -- including extracting retweets,
URLs and (optionally) hashtags.
In addition, please include a file named pr1.R, which
should have your definition for an R function named retweeters,
which should take in a data frame of Tweets as input and it should output
a list of unique source names of retweeted messages. This is the "challenge"
posed at the end of the chapter.
[Optional] If you'd like to try the suggestion extension
at the end of the chapter, you're welcome to! It's totally optional
and is worth up to +5 points of extra credit. It's to write a function, call
it hashes, that takes in a dataframe of Tweets and returns a list
of all of the unique hashtags found in those Tweets, along with the
number of times each one was found. This can be a list of two vectors or, perhaps more
naturally, a data frame as a result.
- For problem 2, you will want to return to the Titanic-survivor dataset of
a few weeks ago. It is linked here in train742.csv.
For this week's assignment, however, you should use R's tree package
to create a tree-based predictive model in pr2.R for whether or not a passenger
would have survived the sinking of the Titanic. You should include an R file
that contains your model, along with a pr2.txt or pr2.doc file
that describes how you arrived at your model.
Your analysis should include some of the important facets we touched on in
week 5's class, including
- a tree that is based on a subset of the Titanic variables (you may use them all or
you may cull them away or convert them to a more suitable form)
- your tree should be checked by cross-validation in order to determine
how many leaves it could usefully have in order to void overfitting the data
- your tree should be pruned to the size you choose (based on the cross-validation
- You should include a table of how many of the 742 test observations are correctly
(and incorrectly) classified by the tree
- We will run your tree on both that test data and some additional training data
that's not included in the test set... .
The third problem this week is also entirely optional (and worth up to +5 points of
extra-credit, similar to the hashes function). It is to build a logistic
regression model based on the Titanic-survival dataset that results in another
predictor for Titanic survival.
My hunch is that the tree-based models will work better than the logistic regression, but
we'll see if that's really true. Plus, there is certainly no requirement that
your logistic model work better than the tree-based models (but it should work better
than pure chance!)