IST 380 homework page

Claremont Graduate University
Data Science Programming (IST 380)
Fall 2013

Hw #3: Linear regression

Back to the IST380 homepage

HW 3 ~ due Tuesday, Feb. 19, 2013
	(Now linked!) Here are the regression examples from the slides and from class In particular, after each comment is a block of examples -- you'll be able to copy-and-paste those into R's console in order to run them. (You'll need to have the `UsingR` package installed.) As usual, for this assignment, please submit a zipped folder named `hw3.zip` that contains a few files: one named `pr1.txt` that includes your interactive session working through chapter 10 of the Data Science book. (Twitter!) The other files: `pr1.R` should be your R script that contains the functions from chapter 10 - plus the chapter-challengeanges to `TweetFrame`. `pr2.R` should be your R script that contains the envelope-challenge questions described below (with a comment about what happens when you run it many times...) `pr3.R` should be your R script containing any functions you write for the temperature-regression problem and a comment in that file should also include a short, written summary of what you find for that problem. *30 pts.* For problem 1, create the file `pr1.txt` with the history from your interactive session(s) working through Chapter 10 of the Data Science book. (Copying-and-pasting seems to be the only thing that works for me, so feel free to do that... .) In addition, create the file `pr1.R` that includes the functions you write as you work through that chapter: the text will present a number of them. Finally, complete the chapter challenge, which asks you to improve the chapter's `TwitterFrame` function so that it returns a data frame with its rows of data in order of tweet-creation time. In addition, make sure that as your `TwitterFrame` function runs, it creates a histogram of the time-differences, as shown in the chapter. Optional For up to +5 points of extra credit, enable your `TwitterFrame` function to create both a histogram of time-differences and a plot of the percentage of the time-differences that are less than a particular time, side-by-side. The percentage plot is shown on page 87 (and the function to create the data appears at the end of the chapter). *30 pts.* For problem 2, create the file `pr2.R`, an R script that will contain three functions: First, write the function `ME_once`, which plays a single "round" of the mystery-envelope challenge from class. Here is the first line showing the inputs to `ME_once`: `ME_once <- function( amount_found=1.0, sors="switch", verbose=TRUE)` Here, the `amount_found` should be the amount found when opening the first envelope; the `sors` should be `"stay"` (with the original envelope) or `"switch"` to the other one; `verbose` indicates whether any printing and dialog occurs. Notice that if `verbose` is `FALSE`, then the function will have to use the provided value of `sors` - otherwise, when `verbose` is `TRUE`, it should ask for one or the other value. The second function should be `ME_ntimes`, with signature `ME_ntimes <- function( n=100 )` where `ME_ntimes` should run the `ME_once` function `n` times and return the mean of all of the amounts won. For each run of `ME_once`, use the default value of `amount_found` (`1.0`). In addition, have the function `ME_ntimes` create a histogram of all of results of the `n` runs of `ME_once` before returning the mean of all those results. In this case, you should see that the histogram is far from a bell curve (about as far as possible)! The final function should be `sample_ME`, with signature `sample_ME <- function( run_me=100 )` Here, `sample_ME` should run `ME_ntimes` itself for a total of `run_me` times! Again, it should plot a histogram of all of those results from `ME_ntimes` before returning the mean of those results. In this case, you should see the histogram approximating a normal ("bell") curve around the expected value. As with the Monty Hall problem, you'll notice that switching in this case is good! However, if switching once is good, then -- since the game is symmetric -- switching twice should be even better! In a comment of a couple of sentences (either using `#` signs or simply a quoted string), describe why switching twice (back to the original envelope) does not yield an even better expected value than switching once. For this problem, you'll hand in both your console interactions (in a file named `pr3.txt`) and your functions in an R script named `pr3.R`. In that script, you should create one or more R functions - of your own design - to analyze the temperature data in this file of global temperature deviations. Note that the units here are 0.01 degrees Celsius and that the data are deviations from the average taken from 1950-1980, which was 14 degrees C (or about 57.2F). There are monthly deviations, as well seasonal ones. The yearly deviations appear in the column J-D, which will be named `$J.D` when you bring it into R. If you would like to model absolute temperatures, rather than deviations, add 14 to that `$J.D` column. However, modeling the deviations is equivalent and just as informative. Your goal is to use R's linear regression capabilities, particularly `lm` in order to build a model of average global temperatures. From that model, you should then predict the average global temperature for 2012 and 2013, along with a 95% confidence interval for your prediction. The nice thing is that we know the average global temperature for 2012 (not yet, however, for 2013). As a result, we'll be able to check at least that first of your two predictions (next week). In your folder, please include A plot of the annual temperatures (or deviations) from 1881 to 2011, along with a linear fit using `lm` A plot of the residuals from that linear fit. Your model's prediction of the global average temperature in 2012. The 95% confidence interval for that prediction. Finally, find the slopes of the linear models for each of the 12 months of the year of deviations (the first 12 columns of the data after the years themselves). Which month is warming the most slowly? Which month is warming the fastest? You should build at least one R function in your `pr3.R` file that helps you with this task of comparing the months. You should include your answers (which month is warming most quickly/least quickly) in a comment in that file.