Here are the in-class examples
to make it easy to follow along...
In particular, after each comment is a block of examples -- you'll be able
to copy-and-paste those into R's console in order to run them. You
may need to install packages as they appear, too... .
As usual, for this assignment, please submit a zipped folder named hw6.zip
that contains a few files: one named pr1.txt that includes
your interactive session (or history)
working through Chapter 13 of the Data Science book, entitled Word
Perfect. That chapter guides
you through a first-pass text analysis of Tweet data and then builds a word-cloud
out of the terms.
Here is a full list of the files, including some extra-credit options:
- pr1.txt should be your history (or console)
interactions for the book's Chapter 13. This introduces the tm (text
mining) package to the R libraries we've used and shows some of its
most important functions. For fun, it also introduces the wordcloud
Also, include a file named pr1.R, which
should have your definition for an R function named make_cloud,
which should take in an input search term and then, following the
suggestions and the functions in the chapter,
it should create a wordcloud from Tweets gathered from that search term.
This is the "challenge" posed at the end of the chapter.
In addition, please include two (or more) image files with pictures of
your textclouds from different search terms.
[Optional] upto +5 points are available
for adding artistic touches, such as color, to your word clouds (you may
have to investigate the wordcloud documentation to do this...).
- For problem 2, you will want to return to the Titanic-survivor dataset of
a few weeks ago. It is linked here in train742.csv.
For this week's assignment, however, you should use R's randomForest package
to create a forest-based
predictive model in pr2.R for whether or not a passenger
would have survived the sinking of the Titanic. Your predictive
- first remove from the dataset any columns that are not useful or usable
- for example,
the randomForest package will not handle categorical variables with more than
32 distinct levels, e.g., names.
- change the data formatting so that it's as useful as possible, e.g.,
making sure the survived column is considered a factor variable!
Other changes that are helpful to you are certainly welcome!
- it should impute the NA variables with values, using
the rfImpute functions, as we did with the iris data in class
- Finally, it should create a random forest based on
the Titanic data; then (either together or in a separate script) you
should run that forest against the original data and indicate what percentage
of the passengers it correctly classifies.