Claremont Graduate University
Data Science Programming (IST 380)
Fall 2013

Hw #2: Predictive statistics

Back to the IST380 homepage

HW 2 ~ due Tuesday, Feb. 12, 2013


New!    Here is a link to the example functions from class. In particular, there is a Monty Hall example called guesser_and_monty_hall.R, as well as a starter file for the Titanic problem named pr3.R. (The other files are some of the console interactions in class.)

For this assignment, submit a zipped folder named that contains a few files: one named pr1.txt that includes your interactive session working through chapters 6-9 of the Data Science book. The others should be a file named pr2.R that contains the functions from problem 2 (below) and a file named pr3.R that contains the functions from problem 3 (below).

  1. 30 pts.
    For problem 1, create the file pr1.txt with the history from your interactive session(s) working through Chapters 6-9 of the Data Science book.

    Here is the csv file with the state population data.

    If it's more convenient or you'd like to split it up into individual files for each chapter, e.g., pr1_6.txt and so on, feel free to do that. Suggestion: I installed RStudio first (Chapter 9) and then worked through all of the examples - it's certainly not required, but it's helpful because RStudio makes it very easy to save your command history: there is a button in the top-right history pane that simply does this! Saving the history (instead of the full give-and-take of the console) is completely OK, since RStudio makes it easy to re-run history commands! You won't need to save the MyMode function definition -- that will be submitted in pr2.R, next.

  2. 30 pts.
    For problem 2, create the file pr2.R, which should include at least these R functions. You're welcome to include additional, "helper" functions, if you'd like:
    • MyMode, the function carefully covered in Chapter 9

    • MHall, a function that plays one round of the three-curtain "Monty Hall" game from Let's Make a Deal, as we demoed in class. The signature of the function should be MHall(chosen_curtain=1, sors="switch", verbose=TRUE), so that the default behavior is to choose curtain #1 (from among 1, 2, and 3), to "switch" when given the opportunity (the other option is "stay"), and to include a bit of (verbose) explanatory chatter through the game. The MHall function should return TRUE if the user wins the grand prize and FALSE otherwise.

    • MHall_N, a function that plays N rounds of the MHall game, always with verbose set to FALSE. The signature of the function should thus be MHall_N(chosen_curtain=1, sors="switch", N=300) and the return value should be the vector of TRUE and FALSE outputs from MHall for all of the N games played.

  3. 40 pts.
    This problem uses some of the Titanic passenger dataset, which is part of Kaggle's introductory challenges. Kaggle is a website that "makes Data Science a sport," by offering challenges in predictive statistics of varying difficulty - some of financial interest to different companies, with rewards attached. (You don't need to join Kaggle to complete this assignment.)

    Start by reading a bit of the background on the Titanic here. Then, download this assignment's dataset of 742 labeled observations (passengers) (The file does have headers.) You will want to read about what each column in the dataset means.

    Machine learning and predictive modeling are, at their essence, methods for writing a single function. That function uses training data in order to successfully predict behavior on new observations (test data). This problem's goal is to write just such a function predict(obs) that takes in a single row of Titanic-passenger data and outputs a 1 if your model predicts that that passenger would have survived and 0 if your model predicts that that passenger would have perished. Note that the obs (observation row) will not provide the true answer, so some predictions will be correct and some may be incorrect.

    So, for problem 3, create the file pr3.R, which should include your hand-built predict function. You may use any mechanisms - such as if/else conditionals, comparisons, loops, or other checks - in order to create your predict function. We will try your function to see how well it does on a set of unseen test data. As you develop your function, you can use the entire training data set to see how well it's doing... . (10 points will be based on your function's predictive power on the unseen data - a little bit of motivation, but with no need to stress about it!)

    In class, we wrote a a few warm-up and helper functions:
    • pr0(obs) was the prediction function that always predicted that the passenger perished.
    • pr1(obs) was the prediction function that predicted that women survived and men perished.
    • pr_df(pr,df) took as input the name of a prediction function in the argument pr and a data frame in the argument df. It then output a vector of predictions for each row (each observation) from that data frame df.