Fall 2017
Question 1
The system.time() function measures the time it takes your computer to evaluate expressions. The input
can be any R command. To input multiple commands, enclose the commands in curly braces {}. Similar to
the behavior. of loops, the input commands will be executed but not printed unless called within print().
For example, consider the following commands:
# How much time does it take to make a sequence of 10 million entries?
system.time(x <- 1:10000000)
## user system elapsed
## 0.046 0.017 0.082
# How much time does it take to make two sequences of 10 million entries?
system.time({
x <- 1:10000000
y <- 10000000:1
})
## user system elapsed
## 0.093 0.031 0.144
The user time is the time dedicated to executing the command, the system time is the time your system
spent doing other tasks, and the elapsed time is the actual elapsed time (e.g., if we were timing with a
clock). The times are shown in seconds.
For the components of this question, execute the following commands:
X <- rnorm(10000)
Y <- rnorm(10000)
(a) Repeated Vector Allocation: Create a storage vector Z of length 0. Write a for() loop such that the ith
iteration of the loop executes the following steps:
(1) Compute the sum of the ith entry of X with the ith entry of Y.
(2) Append the sum from (1) to the end of the current vector Z and save the result as Z.
Use system.time() to measure how long the for() loop takes to execute.
(b) Repeated Vector Assignment: Create a storage vector Z of length 10000. Write a for() loop such that
the ith iteration of the loop executes the following steps:
(1) Compute the sum of the ith entry of X with the ith entry of Y.
(2) Assign the sum from (1) to the ith entry of Z.
Use system.time() to measure how long the for() loop takes to execute.
(c) Vectorization: Use vectorization (not a loop) to compute the sums of the corresponding entries of X and
Y and save the sums to a vector Z. Use system.time() to measure how long the sums take to execute.
Compare the computation times (elapsed) between the three approaches.
Note: This question highlights why vectorized operations are preferred over for() loops whenever possible.
1
The following information is used in Questions 2, 3, and 4.
Consider the dataset found at: http://www.math.hope.edu/isi/data/chap3/CollegeMidwest.txt
The data contains two variables gathered from the registrar at a small midwestern college on all students at
the college in spring 2011.
The variables are:
• OnCampus: Whether or not a student lives on campus (Y or N)
• CumGpa: The student’s cumulative GPA.
Since this is data on all students at the college, we will treat the students observed in this data to be the
population.
Question 2
(a) Set the seed to 24601 and simulate the sampling distribution of the difference in mean cumulative GPA
between the students who live off campus and the students who live on campus. Simulate the difference
in sample means from 1000 random samples of size 30.
(b) Plot a histogram of the sampling distribution of differences in sample means from part (a). Add vertical
lines that show the differences in sample means that are 2 standard errors away from the mean.
(c) Compute the mean and standard deviation of the simulated distribution of differences in sample means.
Use these values to superimpose a normal curve over the histogram.
(d) Suppose we observe a random sample of size 30 with an observed difference in mean cumulative
GPA between off campus and on campus students to be 0.48. Based on your approximate sampling
distribution in part (a), what is the approximate probability of observing a difference in sample means
greater than 0.48?
Question 3
Suppose we are interested in using a random sample of 30 students to decide if the mean cumulative GPA of
the population of students at the College of the Midwest is different from 3.5 or not. The null and alternative
hypotheses are given by
H0 : µ = 3.5
Ha : µnegationslash= 3.5
The t.test() function performs one and two sample t-tests on vectors of numeric data. The basic syntax for
t.test() is t.test(x,y,alternative,mu,conf.level).
• The t.test() function inputs a vector x of values from your sample and conducts a one-sample t-test
for the mean. If a second vector in the argument y is included, t.test() will conduct a two-sample
t-test for a difference in means.
• The alternative argument inputs a character value of "two.sided", "greater", or "less", depending
on the alternative hypothesis we are considering. By default, t.test() will conduct a two-sided
hypothesis (i.e., alternative="two.sided").
• The mu argument inputs a numeric value that specifies the value of the mean parameter µ under the
null hypothesis. The default value is mu = 0, i.e., the default null hypothesis is µ = 0.
• Inadditiontoahypothesistest, the t.test() functionalsooutputsaconfidenceinterval, withconfidence
level set by the conf.level argument. By default, the confidence level is set to conf.level=0.95.
2
(a) Set the seed to 20 and draw a random sample of size 30 from the CollegeMidwest.txt data.
(b) For the random sample in (a), compute the observed t-statistic t = ¯x−µs/√n, where ¯x is the sample mean,
s is the sample standard deviation, and n is the sample size. How would you interpret this value?
(c) Use the t.test() function to conduct a one-sample t-test to decide if the true mean cumulative GPA
of all the students at the College of the Midwest is different 3.5. Use a significance level of α = 0.05.
(d) What is the mode and class of the output of t.test() from (c)? Use this information to extract the
95% confidence interval vector from the t.test() output object. Is 3.5 inside this interval? What does
this say about whether the true mean cumulative GPA is 3.5 or not?
Note: The t-test (and t.test()) relies on the normal approximation to the sampling distribution of the
sample mean. When conducting a t-test, it is assumed that the conditions for the Central Limit Theorem are
satisfied.
Question 4
The confidence level refers to the long-run proportion of random samples (of a fixed size) whose confidence
intervals contain the true population parameter. We want to illustrate this by simulation.
(a) Suppose conducting a survey of students from the College of the Midwest consists of the following steps:
(1) Select a random sample of 30 students.
(2) Compute the mean cumulative GPA for the 30 students in the sample.
(3) Construct a 95% confidence interval for the population mean cumulative GPA.
Set the seed to 9999 and repeat steps 1, 2, and 3 a total of 10000 times. For each random sample,
calculate ¯x and construct a 95% confidence interval.
Hint: You can use the t.test() function from the previous question to construct the 95%
confidence interval.
(b) Use the full data to compute the true mean cumulative GPA for the population of all students at the
College of the Midwest. Find the proportion of the 10000 confidence intervals that contain the true
population mean. Is this proportion consistent with what you expected?
(c) Extra Credit: Create a plot of the first 100 confidence intervals. Be sure that the plot satisifies the
following criteria:
• The limits of the axes should be large enough to contain the lengths of all of the confidence
intervals.
• Represent each sample mean by a point and each corresponding confidence interval by a line
segment through the point.
• Color the points and intervals to correspond to whether the interval was successful at
capturing the true population mean. In other words, use one color for the intervals that
contain the true mean, and use a different color for the intervals that do not contain the
true mean.
• Add a straight line that shows the true population mean.
• Add a legend that explains the color coding in the plot.
Hint: Use the segments() function. You do not need a for() loop to create this plot.
3
Question 5
Consider the while() loop below that computes all Fibonacci numbers less than 500.
# fib1 and fib2 will represent the two latest terms in the sequence.
fib1 <- 1 # Initialize fib1
fib2 <- 1 # Initialize fib2
# Create the vector to store the output from the while loop.
full.fib <- c(fib1,fib2)
# While the sum of the last two terms is less than 500, execute the following commands.
while(fib1 + fib2 < 500){
# Save the latest term to old.fib2.
old.fib2 <- fib2
# Compute the sum of the latest two terms and assign the sum to be the new latest term.
fib2 <- fib1 + fib2
# Append the latest term to the end of the full.fib vector with all previous terms.
full.fib <- c(full.fib,fib2)
# Save the previously latest term (now the second to last term) to fib1.
fib1 <- old.fib2
}
# Print the output from the while loop.
full.fib
## [1] 1 1 2 3 5 8 13 21 34 55 89 144 233 377
(a) The variable old.fib2 is not actually necessary. Rewrite the while() loop with the update of fib1
based on just the current values of fib1 and fib2.
(b) In fact, fib1 and fib2 are not necessary either. Rewrite the while() loop without using any variables
except full.fib.
(c) Determine the number of Fibonacci numbers less than 1000000.