# CS 147 Homework Assignment 2

This homework assignment is due at 12 AM on Thursday, February 16, 2012 (i.e., the Wednesday/Thursday boundary). Please give your solutions to me, slide them under my door, or e-mail them.

I expect that it will take you about 3 hours to complete the assignment.

If you use a Microsoft product to do your graphing, be sure to turn off the stupid gray background, and ensure that color isn't essential for interpreting the graphs (since I might decide to print things on a B&W printer).

You are encouraged to either use a standard software tool or to write code to help solve these problems, so that you will have techniques you can use again in the future.

1. The sizes (in bytes) of a set of HTML files are given in prob1.txt.
1. What are the 1st and 3rd quartiles for this data?
2. Are quartiles a good choice to describe dispersion in this case? If not, what would you use instead?
3. What are the mean file size and the standard deviation?
4. What are the 90% and 95% confidence intervals for the mean?
5. Is your calculation valid? Why or why not?
6. What is the 90% confidence interval for the proportion of files that are less than 20,000 bytes in size? Use the formula given in Jain.
7. Is the confidence interval for the proportion valid?
8. At 90% confidence, is the mean file size greater than 16K (16384) bytes?
2. The raw midterm scores for two sections of a class are given in prob2-1.txt and prob2-2.txt.
1. Is either section better than the other at 90% confidence? Which?
2. Is either section better at 80% confidence?
3. Based on the data from the combined sections, how many students would have to take the midterm if we wanted the mean score to have a 99% confidence interval that was +/- 5% of the mean?
3. Correct timekeeping is very important in navigation. Traditionally, seafarers try to never reset their clocks; instead they calculate a daily drift rate and apply a correction factor. The file prob3.txt contains a series of observations of the number of days since a particular wristwatch was first started (first column) and the number of seconds of error exhibited by the watch relative to an atomic clock (second column). The columns are tab-separated, so they should be easy to import into a spreadsheet.
1. Fit a linear regression to this data.
2. Which of the fitted parameters are significant at 95% confidence?
3. How much of the variation is explained by the regression?
4. Based on the R-squared value, is the regression valid? (Show your calculations.)
5. Using visual tests, verify or refute the validity of your regression model according to the four criteria listed on pages 235-237 of the textbook.
6. What would the clock error be at day 730.000?
7. At the equator, a 4-second clock error will produce a navigation error of exactly one nautical mile. On day 730, a navigator crossing the equator uses your regression model to correct the clock reading, then calculates her position. At 90% confidence, what is the plus/minus error introduced by the watch after the correction has been made, expressed in nautical miles?