Here is a link to the example
functions from class. In particular, there is a Monty Hall example called
guesser_and_monty_hall.R, as well
as a starter file for the Titanic problem named pr3.R.
(The other files are some of the console interactions in class.)
For this assignment, submit a zipped folder named hw2.zip
that contains a few files: one named pr1.txt that includes
your interactive session working through chapters 6-9 of the Data Science book.
The others should
be a file named pr2.R that contains the functions from
problem 2 (below) and a file named pr3.R that contains the
functions from problem 3 (below).
For problem 1, create the file pr1.txt with the history from your interactive
session(s) working through Chapters 6-9 of the Data Science book.
Here is the csv file with the state population data.
If it's more convenient or you'd like to split it up into individual
files for each chapter, e.g., pr1_6.txt and so on, feel free to do that.
Suggestion: I installed RStudio first (Chapter 9) and then
worked through all of the examples - it's certainly not required, but it's helpful
because RStudio makes it very easy to save your command history: there is a button
in the top-right history pane that simply does this! Saving the history (instead
of the full give-and-take of the console) is completely OK, since RStudio makes
it easy to re-run history commands! You won't need to save the MyMode
function definition -- that will be submitted in pr2.R, next.
For problem 2, create the file pr2.R, which should include
at least these R functions. You're welcome to include additional,
"helper" functions, if you'd like:
- MyMode, the function carefully covered in Chapter 9
- MHall, a function that plays one round of the three-curtain "Monty Hall"
game from Let's Make a Deal, as we demoed in class. The signature of the function
should be MHall(chosen_curtain=1, sors="switch", verbose=TRUE), so that the default behavior
is to choose curtain #1 (from among 1, 2, and 3), to "switch" when given the
opportunity (the other option is "stay"), and to include
a bit of (verbose) explanatory chatter through the game. The MHall function
should return TRUE if the user wins the grand prize and FALSE otherwise.
- MHall_N, a function that plays N rounds of the MHall
game, always with verbose set to FALSE. The signature of the function
should thus be MHall_N(chosen_curtain=1, sors="switch", N=300) and the return
value should be the vector of TRUE and FALSE outputs from MHall
for all of the N games played.
This problem uses some of the Titanic passenger dataset,
which is part of Kaggle's introductory challenges. Kaggle
is a website that "makes Data Science a sport," by offering challenges in predictive
statistics of varying difficulty - some of financial interest to
different companies, with rewards attached. (You don't need to join Kaggle to
complete this assignment.)
Start by reading a
bit of the background on the Titanic here. Then, download
this assignment's dataset of 742 labeled observations (passengers) (The file does have headers.) You will
want to read about what each
column in the dataset means.
Machine learning and predictive modeling are, at their essence, methods for writing
a single function. That function uses training data in order to successfully predict
behavior on new observations
(test data). This problem's goal is to write just such a function predict(obs) that
takes in a single row of Titanic-passenger data and outputs a 1 if your model
predicts that that passenger would have survived and 0 if your model
predicts that that passenger would have perished. Note that the obs (observation row)
will not provide the true answer, so some predictions will be correct and some may
So, for problem 3, create the file pr3.R, which should include
your hand-built predict function. You may use any mechanisms - such as if/else
conditionals, comparisons, loops, or other checks - in order to create your predict function.
We will try your function to see how well it does on a set of unseen test data. As you develop your
function, you can use the entire training data set to see how well it's doing... .
(10 points will be based on your function's predictive power on the unseen
data - a little bit of motivation, but with no need to stress about it!)
In class, we wrote a a few warm-up and helper functions:
- pr0(obs) was the prediction function that always predicted that the passenger
- pr1(obs) was the prediction function that predicted that women survived and
- pr_df(pr,df) took as input the name of a
prediction function in the argument pr and a data frame in the argument df.
It then output a vector of predictions for each row (each observation) from that data frame df.