Claremont Graduate University
Data Science Programming (IST 380)
Fall 2013

Hw #4: Exploratory Data Analysis

Back to the IST380 homepage

HW 4 ~ due Tuesday, Mar. 5, 2013


(will be linked!)    Here are the in-class examples to make it easy to follow along...

In particular, after each comment is a block of examples -- you'll be able to copy-and-paste those into R's console in order to run them. You may need to install packages as they appear, but this hasn't caused a problem so far... .

As usual, for this assignment, please submit a zipped folder named that contains a few files: one named pr1.txt that includes your interactive session (or history) working through chapter 11 of the Data Science book. (Poisson-distribution-modeling for Twitter)
Here is a full list:

  1. pr1.txt should be your history (or console) interactions for the book's Chapter 11. This is a continuation of the modeling of Twitter feeds -- and comparing hashtag popularity with Poisson distributions. Include pr1.R, the R script that contains your functions from chapter 11 - you may want to start with your solution from Chapter 10, since it builds from there.

  2. pr2.docx (or .pdf or any other ordinary document format) This file should contain your exploratory graphs for the LendingClub data set. Here, all of the columns that should be numeric have been converted already so that this version is easier to use:


    In addition, the archive including the slides also now contains this updated dataset and an example of the data exploration and an example of the data modeling/analysis. Those examples are also linked below.

    RStudio makes it easy to export graphics/plots as images - from there, you can import them into any (readable) type of document you choose... .

    You should use this week's lecture as a guide, but also use this as a chance to determine what data affect the interest rate for the 2400 loans to the greatest extent.

    All of your graphs should have a short (one or two sentence) caption that tells indicates what the graph shows and what take-home message you deduce from the plot. Negative take-home messages are completely OK (and by far the more common type!) So, if the graph indicates that two variables are not correlated, by all means say so.

    Be sure to include at least 8 graphs - more would be OK, but are not needed. You may want to try several and then choose the most interesting (in which case you will have fewer negative results to show off...) Within your group of graphs, be sure to have at least one of each of these types:
    • A comparative boxplot (with more than one plot)
    • A barplot for one or more factor variables
    • A histogram
    • A layered density plot (with more than one density)
    • A scatter plot using color
    • A scatter plot with an overlaid line or other graphics
    The other types of plots are optional, but certainly welcome! Especially colorful or unusual graphs can certainly garner additional points and/or awe!

    Here is an example write-up of a plot in case you'd like to use it as a template:   doc   pdf

  3. Finally, include an R script pr3.R that uses both regression with factor variables and ordinary (covariate) linear regression in order to create a predictive model of the interest rate based on the other characteristics of each loan. You should create 3 models (they can be in separate scripts or all together):
    • at least one model based on a factor variable (similar to our class analysis of movie score vs. MPAA rating)
    • at least one model based on a continuous numeric variable (a "covariate"), similar to the linear models we've examined so far -- and to the example below
    • and, one model of your choice -- in particular, a model based on what you feel is the single most-important variable (column) in the dataset. This can be a numeric or factor variable.

    In a MWord document (or some other type) include a description of what variables you considered important and how you created your predictive model. Be sure to describe how to run your model (using your own function or using predict or some other way...) so that we can test it on the additional loan data!

    Here are two examples of a covariate linear model of interest rate against loan number, which is not the most important variable in the dataset, unsurprisingly!   doc   pdf