Here are the regression examples
from the slides and from class
In particular, after each comment is a block of examples -- you'll be able
to copy-and-paste those into R's console in order to run them. (You'll need
to have the UsingR package installed.)
As usual, for this assignment, please submit a zipped folder named hw3.zip
that contains a few files: one named pr1.txt that includes
your interactive session working through chapter 10 of the Data Science book.
The other files:
- pr1.R should be your R script that contains the functions from chapter 10 - plus
the chapter-challengeanges to TweetFrame.
- pr2.R should be your R script that contains the envelope-challenge questions
described below (with a comment about what happens when you run it many times...)
- pr3.R should be your R script containing any functions you write for the
temperature-regression problem and a comment in that file should also include
a short, written summary of what you find for that problem.
For problem 1, create the file pr1.txt with the history from your interactive
session(s) working through Chapter 10 of the Data Science book. (Copying-and-pasting
seems to be the only thing that works for me, so feel free to do that... .)
In addition, create the file pr1.R that includes the functions you write
as you work through that chapter: the text will present a number of them.
Finally, complete the chapter challenge, which asks you to improve the
chapter's TwitterFrame function so that it returns a data frame
with its rows of data in order of tweet-creation time. In addition, make sure
that as your TwitterFrame function runs, it
creates a histogram of the time-differences, as shown in the chapter.
For up to +5 points of extra credit, enable your TwitterFrame function
to create both a histogram of time-differences and a plot of the
percentage of the time-differences that are less than a particular time, side-by-side.
The percentage plot is shown on page 87 (and the function to create the data
appears at the end of the chapter).
For problem 2, create the file pr2.R, an R script that
will contain three functions:
- First, write the function
ME_once, which plays a single "round" of the
mystery-envelope challenge from class. Here is
the first line showing the inputs to ME_once:
Here, the amount_found should be the amount
found when opening the first envelope; the sors should be
"stay" (with the original envelope) or "switch" to the
other one; verbose indicates whether any printing and dialog
occurs. Notice that if verbose is FALSE, then the
function will have to use the provided value of sors - otherwise,
when verbose is TRUE, it should ask for one or the other value.
ME_once <- function( amount_found=1.0, sors="switch", verbose=TRUE)
- The second function should be ME_ntimes, with signature
where ME_ntimes should run the ME_once function n
times and return the mean of all of the amounts won. For each run of
ME_once, use the default value of amount_found (1.0).
In addition, have the function ME_ntimes create a histogram of all of
results of the n runs of ME_once before returning the mean of
all those results. In this case, you should see that the histogram is far
from a bell curve (about as far as possible)!
ME_ntimes <- function( n=100 )
- The final function should be sample_ME, with signature
Here, sample_ME should run ME_ntimes itself for a total of
run_me times! Again, it should plot a histogram of all of those
results from ME_ntimes before returning the mean of those results.
In this case, you should see the histogram approximating a normal ("bell")
curve around the expected value.
sample_ME <- function( run_me=100 )
As with the Monty Hall problem, you'll notice that switching in this case is good!
However, if switching once is good, then -- since the game is symmetric --
switching twice should be even better!
In a comment of a couple of sentences
(either using # signs or simply a quoted string), describe
why switching twice (back to the original envelope) does not
yield an even better expected value than switching once.
For this problem, you'll hand in both your console
interactions (in a file named pr3.txt) and
your functions in an R script named pr3.R.
In that script, you should create one or more R functions - of your
own design - to analyze the
temperature data in this file of global
temperature deviations. Note that the units here are 0.01 degrees
Celsius and that the data are deviations from the average
taken from 1950-1980, which was 14 degrees C (or about 57.2F).
There are monthly deviations, as well seasonal ones. The yearly
deviations appear in the column J-D, which will be named $J.D
when you bring it into R.
If you would like to model absolute temperatures,
rather than deviations, add 14 to that $J.D column.
However, modeling the deviations is equivalent and just as informative.
Your goal is to use R's linear regression capabilities, particularly lm
in order to build a model of average global temperatures. From that model, you should
then predict the average global temperature for 2012 and 2013, along with
a 95% confidence interval for your prediction. The nice thing
is that we know the average global temperature for 2012 (not yet, however,
for 2013). As a result, we'll be able to check at least that first of your
two predictions (next week).
In your folder, please include
- A plot of the annual temperatures (or deviations) from 1881 to 2011,
along with a linear fit using lm
- A plot of the residuals from that linear fit.
- Your model's prediction of the global average temperature in 2012.
- The 95% confidence interval for that prediction.
- Finally, find the slopes of the linear models for each of the
12 months of the year of deviations (the first 12 columns of the data
after the years themselves). Which month is warming the most slowly?
Which month is warming the fastest? You should build at least one
R function in your pr3.R file that helps you with
this task of comparing the months. You should include your answers (which
month is warming most quickly/least quickly) in a comment in that file.