(will be linked!)
Here are the in-class examples
to make it easy to follow along...
In particular, after each comment is a block of examples -- you'll be able
to copy-and-paste those into R's console in order to run them. You
may need to install packages as they appear, but this hasn't caused a
problem so far... .
As usual, for this assignment, please submit a zipped folder named hw4.zip
that contains a few files: one named pr1.txt that includes
your interactive session (or history)
working through chapter 11 of the Data Science book.
(Poisson-distribution-modeling for Twitter)
Here is a full list:
- pr1.txt should be your history (or console)
interactions for the book's Chapter 11. This is a continuation of the
modeling of Twitter feeds -- and comparing hashtag popularity with Poisson
distributions. Include pr1.R, the R script that contains
your functions from chapter 11 - you may want to start with your
solution from Chapter 10, since it builds from there.
- pr2.docx (or .pdf or any other ordinary document format) This
file should contain your exploratory graphs for the LendingClub data set. Here,
all of the columns that should be numeric have been converted already so that
this version is easier to use:
In addition, the week4.zip archive including the
slides also now contains this updated dataset and an example of the data
exploration and an example of the data modeling/analysis. Those examples
are also linked below.
RStudio makes it easy to export graphics/plots as images - from there, you
can import them into any (readable) type of document you choose... .
You should use this week's lecture as a guide, but also use this
as a chance to determine what data affect the interest rate for the 2400 loans
to the greatest extent.
All of your graphs should have a short (one or two sentence) caption that tells
indicates what the graph shows and what take-home message you deduce from the
plot. Negative take-home messages are completely OK (and by far the more common
type!) So, if the graph indicates that two variables are not correlated,
by all means say so.
Be sure to include at least 8 graphs - more would be OK, but are not needed.
You may want to try several and then choose the most interesting (in
which case you will have fewer negative results to show off...)
Within your group of graphs, be sure to have at least one of each of these
The other types of plots are optional, but certainly welcome!
Especially colorful or unusual graphs can certainly garner additional points
- A comparative boxplot (with more than one plot)
- A barplot for one or more factor variables
- A histogram
- A layered density plot (with more than one density)
- A scatter plot using color
- A scatter plot with an overlaid line or other graphics
Here is an example write-up of a plot in case you'd like to use it as a template:
Finally, include an R script pr3.R that uses both regression with
factor variables and ordinary (covariate) linear regression in order to
create a predictive model of the interest rate based on the other
characteristics of each loan. You should create 3 models (they
can be in separate scripts or all together):
- at least one model based on a factor variable (similar to our
class analysis of movie score vs. MPAA rating)
- at least one model based on a continuous numeric variable (a "covariate"),
similar to the linear models we've examined so far -- and to the example below
- and, one model of your choice -- in particular, a model based on what
you feel is the single most-important variable (column) in the dataset. This can
be a numeric or factor variable.
In a MWord document (or some other type)
include a description of what
variables you considered important and how you created your predictive model.
Be sure to describe how to run your model (using your own function or using
predict or some other way...) so that we can test it on the
additional loan data!
Here are two examples of a covariate linear model of interest rate against
loan number, which is not the most important variable in the dataset, unsurprisingly!