辅导R、CLT 讲解留学生、讲解CLT

• For Part A and Part B, please install the tidyverse package using install.packages so that you can
use the dplyr library. Once you have done that use the read_csv() function to read the csv files
and not the read.csv() function.
• You will need to write as a markdown document. As a precaution, please submit your
corresponding html files also.
Part A (40 marks)
The file household.csv contains (fictional) data from a survey of 500 randomly selected households.
a. Indicate the type of data (categorical or continuous) for each of the variables included in the
survey.
b. For each of the categorical variables in the survey, indicate whether you believe the variable is
nominal or ordinal.
c. Create a histogram for each of Debt. What does the histogram tell you about debt?
d. Find the maximum and minimum debt levels for the households in this sample.
e. Report the indebtedness levels at each of the 25th, 50th, and 75th percentiles.
f. Report and interpret the interquartile range for the indebtedness levels of households?
Part B (40 marks)
The file SupermarketTransactions.csv contains data on over 14.000 transactions. There are
two numeric variables, Units Sold and Revenue. The first of these is discrete and the second is
continuous. For each of the following, do whatever it takes to create a bar chart of counts for Units Sold
and a histogram of Revenue for each of the given subpopulation of purchases below.
a. All purchases made during January and February of 20081.
b. All purchase made by married female homeowners in the state of California.
Write a summary that is less than 100 words that describes your analysis.
Use the date conversion facility in R to convert the dates which are strings to the date format by repurposing the
following example: dates <- as.Date(strDates, "%m/%d/%Y")from this link. To compare dates
you can refer to this link.

Part C
All of you must have heard about the central limit theorem (CLT). If not, have a look at this video.
a. Then run the following R commands. Please spend some time trying understand the code well.
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
set.seed(1239)
r1 <- rnorm2(100,25,4)
r2 <- rnorm2(50,10,3)
samplingframe. <- c(r1,r2)
hist(samplingframe, breaks=20,col = "pink")
Please describe the distribution that you obtain in one or two sentences.
Hint for parts b and c: use the replicate and apply functions
b. Draw 50 samples of size 15 from the sampling frame. in part a, and plot the sampling distribution
of means as a histogram.
c. Draw 50 samples of size 45 from the sampling frame. in part a, and plot the sampling distribution
of means as a histogram.
d. Please ensure that the distributions in parts b and c are side-by-side on the same plot. Explain the
three histograms in terms of their differences and similarities (in less than 25 words)
e. Explain CLT in your own words in one or two sentences.
f. Does this exercise help you understand CLT? If so why? If not, why not? Restrict your response to
one or two sentences.