Claremont Graduate University
Data Science Programming (IST 380)
Fall 2013

Hw #3: Linear regression

Back to the IST380 homepage

HW 3 ~ due Tuesday, Feb. 19, 2013


(Now linked!)    Here are the regression examples from the slides and from class

In particular, after each comment is a block of examples -- you'll be able to copy-and-paste those into R's console in order to run them. (You'll need to have the UsingR package installed.)

As usual, for this assignment, please submit a zipped folder named that contains a few files: one named pr1.txt that includes your interactive session working through chapter 10 of the Data Science book. (Twitter!) The other files:

  • pr1.R should be your R script that contains the functions from chapter 10 - plus the chapter-challengeanges to TweetFrame.
  • pr2.R should be your R script that contains the envelope-challenge questions described below (with a comment about what happens when you run it many times...)
  • pr3.R should be your R script containing any functions you write for the temperature-regression problem and a comment in that file should also include a short, written summary of what you find for that problem.

  1. 30 pts.
    For problem 1, create the file pr1.txt with the history from your interactive session(s) working through Chapter 10 of the Data Science book. (Copying-and-pasting seems to be the only thing that works for me, so feel free to do that... .)

    In addition, create the file pr1.R that includes the functions you write as you work through that chapter: the text will present a number of them.

    Finally, complete the chapter challenge, which asks you to improve the chapter's TwitterFrame function so that it returns a data frame with its rows of data in order of tweet-creation time. In addition, make sure that as your TwitterFrame function runs, it creates a histogram of the time-differences, as shown in the chapter.

    Optional    For up to +5 points of extra credit, enable your TwitterFrame function to create both a histogram of time-differences and a plot of the percentage of the time-differences that are less than a particular time, side-by-side. The percentage plot is shown on page 87 (and the function to create the data appears at the end of the chapter).

  2. 30 pts.
    For problem 2, create the file pr2.R, an R script that will contain three functions:
    • First, write the function ME_once, which plays a single "round" of the mystery-envelope challenge from class. Here is the first line showing the inputs to ME_once:
            ME_once <- function( amount_found=1.0, sors="switch", verbose=TRUE)
      Here, the amount_found should be the amount found when opening the first envelope; the sors should be "stay" (with the original envelope) or "switch" to the other one; verbose indicates whether any printing and dialog occurs. Notice that if verbose is FALSE, then the function will have to use the provided value of sors - otherwise, when verbose is TRUE, it should ask for one or the other value.

    • The second function should be ME_ntimes, with signature
            ME_ntimes <- function( n=100 )
      where ME_ntimes should run the ME_once function n times and return the mean of all of the amounts won. For each run of ME_once, use the default value of amount_found (1.0). In addition, have the function ME_ntimes create a histogram of all of results of the n runs of ME_once before returning the mean of all those results. In this case, you should see that the histogram is far from a bell curve (about as far as possible)!

    • The final function should be sample_ME, with signature
            sample_ME <- function( run_me=100 )
      Here, sample_ME should run ME_ntimes itself for a total of run_me times! Again, it should plot a histogram of all of those results from ME_ntimes before returning the mean of those results. In this case, you should see the histogram approximating a normal ("bell") curve around the expected value.

    As with the Monty Hall problem, you'll notice that switching in this case is good! However, if switching once is good, then -- since the game is symmetric -- switching twice should be even better! In a comment of a couple of sentences (either using # signs or simply a quoted string), describe why switching twice (back to the original envelope) does not yield an even better expected value than switching once.

  3. For this problem, you'll hand in both your console interactions (in a file named pr3.txt) and your functions in an R script named pr3.R. In that script, you should create one or more R functions - of your own design - to analyze the temperature data in this file of global temperature deviations. Note that the units here are 0.01 degrees Celsius and that the data are deviations from the average taken from 1950-1980, which was 14 degrees C (or about 57.2F). There are monthly deviations, as well seasonal ones. The yearly deviations appear in the column J-D, which will be named $J.D when you bring it into R. If you would like to model absolute temperatures, rather than deviations, add 14 to that $J.D column. However, modeling the deviations is equivalent and just as informative.

    Your goal is to use R's linear regression capabilities, particularly lm in order to build a model of average global temperatures. From that model, you should then predict the average global temperature for 2012 and 2013, along with a 95% confidence interval for your prediction. The nice thing is that we know the average global temperature for 2012 (not yet, however, for 2013). As a result, we'll be able to check at least that first of your two predictions (next week).

    In your folder, please include
    • A plot of the annual temperatures (or deviations) from 1881 to 2011, along with a linear fit using lm
    • A plot of the residuals from that linear fit.
    • Your model's prediction of the global average temperature in 2012.
    • The 95% confidence interval for that prediction.
    • Finally, find the slopes of the linear models for each of the 12 months of the year of deviations (the first 12 columns of the data after the years themselves). Which month is warming the most slowly? Which month is warming the fastest? You should build at least one R function in your pr3.R file that helps you with this task of comparing the months. You should include your answers (which month is warming most quickly/least quickly) in a comment in that file.