讲解R程序、讲解留学生Bootstrapping、解析Data analysis编程

This file contains instructions for your midterm exam. You must upload an .Rmd file to your
Github repo as well as the knitted .html by 11pm on February 28. Late submissions are not
allowed.
Read the instructions VERY carefully. Follow them!
Do not copy my instructions into your paper! Do not copy your writeup from HW 1 (or 2 or 3)
into your paper. This will result in an automatic deduction of 50 points. If you use any outside
references (Wikipedia, etc.) you must give appropriate credit and use your own words. Text
summarized from outside references and unattributed or directly copied with attribution will
result in an automatic deduction of 50 points for each instance. I will use software which checks
for code copied between your submissions and checks for plagarism against an internet bank.
Statement on collaboration
By writing your full name in this section, you verify that you have not discussed this exam with anyone
except for the instructor or the TA. You also verify that you have not copied materials from anywhere and
have not used any material other than the text book or the materials on this semester’s Github repo. Finally,
you affirm that you have adhered to the “Academic Integrity”, “Note selling”, and “Solutions” sections on
the course syllabus.
Writing functions, Linear models, and model selection
1. Write a function which generates data from a linear model. Your function should accept three arguments:
the number of observations, a vector of betas, and the standard deviation of the noise. The standard
deviation should have a default value of 1. Your function should check that the standard deviation
satisfies any necessary constraints. Check that beta has at least one entry. Each entry of the design
matrix should be i.i.d. Uniform. between -10 and 10.
Your response vector must be equal to the beta vector times the design matrix plus mean zero, i.i.d.
normal noise with the appropriate standard deviation. Your function must return a data frame. The first
variable should be the response, named y. The remaining columns should be named X1 through Xp where
p=length(betas). You may need to use the paste0 command to create the names for these columns. Try
?paste0 for help. (Hint: Examine what happens when you create the data frame. You may not need to
rename things if you’re careful.)
Be sure to use good coding practices. Either comment lines or use good names. If I can’t understand your
function, you will lose points. If your function does not correctly handle the inputs, you will lose points. This
means you need to check for possible errors and return warnings. Display this code at the top of this section
(echo=TRUE).
2. Before doing the following, set the seed (set.seed()) to your 10-digit university ID number. Call your
function to generate data with n=400 observations and beta=c(5,0).
3. Estimate all possible linear models with two predictors and interaction. Use half your data. That
is, regress y on the intercept only. Then regress y on x1. Then regress y on x2. Then on x1 and x2.
Then on x1*x2, then on x1 and x1*x2, etc. This should result in 8 different models. All models have
an intercept, but the first has only the intercept. Note: if you type formula(quotesingle.ts1y~X1*X2quotesingle.ts1) this will be
expanded to quotesingle.ts1y~X1+X2+X1*X2quotesingle.ts1. You need to use quotesingle.ts1y~I(X1*X2)quotesingle.ts1 to avoid this behavior.
1
4. Use cross validation (any variety) on that same half to choose the best model of the 8.
5. For your chosen model use the other half to estimate that linear model. Provide a table which contains
the estimated coefficient(s) and 95% confidence intervals.
6. Do any of your intervals include zero? Is this to be expected?
1. Load the data set bootdata.Rdata (using load("bootdata.Rdata")).
2. Examine a pairs plot of the continuous predictors against each other. Do you notice anything special?
3. Fit a linear model to the data. Regress y on all three x variables without intercept.
4. Examine the usual regression diagnostics (at least a qq-plot of the residuals and plots of the predictions
against the fitted values). What do you notice? How do you feel about this model.
5. Produce a tabel with 7 columns. The first column should be the estimated coefficients from your model.
Then produce 3 sets of 95% confidence intervals (two columns each). The first set (columns 2 and 3)
should be those from your fitted model. The second set should be produced by the residual bootstrap
(resample the residuals). The third set should be from the nonparametric bootstrap (resample the
rows of the data). If you understand Chapter 6 (and how to use the code from rr7 or ic6) this will be
easy. Comment on the confidence intervals. Which one(s) are most appropriate?
Data analysis
For this exercise, while there are wrong answers, there are many possible right answers. Any data
analysis decisions or conclusions that you make should be justified and explained. Your job is to
correctly ananlyze the data, not force the analysis to match a pre-conceived idea. Discuss whether
or not your results match your hypothesis.
Data and research problem
This assignment will look at economic mobility across generations in the contemporary USA. The data come
from a large study, based on tax records, which allowed researchers to link the income of adults to the income
of their parents several decades previously. For privacy reasons, we don’t have that individual-level data,
but we do have aggregate statistics about economic mobility for several hundred communities, containing
most of the American population, and covariate information about those communities. We are interested in
predicting economic mobility from the characteristics of communities.
The data file mobility.Rdata has information on 741 communities (technically, “commuting zones”. These
include cities and their suburbs and exurbs, but also many rural areas with integrated economies.). The
variable we want to predict is economic mobility; the rest are predictor variables (covariates).
• Mobility: The probability that a child born in 1980–1982 into the lowest quintile (20%) of household
income will be in the top quintile at age 30. Individuals are assigned to the community they grew up
in, not the one they were in as adults.
• Population in 2000.
• Is the community primarily urban or rural? (yes or no).
• Racial segregation: a measure of residential segregation by race (low, medium, high).
• Mean income: Average income per capita in 2000.
• School expenditures: Average spending per pupil in public schools.
• Student/teacher ratio: Number of students in public schools divided by number of teachers.
• College graduation rate: Residuals from a linear regression of the actual graduation rate on household
income per capita.
• Longitude: Geographic coordinate for the center of the community
• Latitude: Ditto
• ID: A numerical code, identifying the community.
• Name: the name of principal city or town.
• State: the state of the principal city or town of the community.
Some of these variables are missing for some communities, and this may make a difference for some questions.
Specific Analytical Questions
You are to investigate the following question (briefly):
What is the relationship between measures of educational effectiveness and economic mobility?
Suggested Outline
1. Introduction Write four to five sentences introducing the research problem and describing the specific
research hypothesis. Cite any information sources in parentheses or foot- or end- notes.
2. EDA
• How many observations do you have? What variables are available?
• Make a map of economic mobility over the US. What patterns do you see?
• Make a pairs plot of all continuous predictors.
• Are there any outliers to worry about?
• Describe each of these plots and any patterns you see.
• Which variables seem associated with mobility?
3. Initial modeling You will examine two models:
1. LinearlyregressMobilityonIncome,School_spending,Student_teacher_ratio,Student_teacher_ratioˆ2,
Graduation, and Urban*Racial_Segregation (as well as their levels).
2. Use npreg to regress count on Mobility on Income, School_spending, Student_teacher_ratio,
Graduation, Urban and Racial_Segregation.
Why don’t we include interactions or squared terms in the non-parametric regression? Estimate the models
using all the data (no need to do sample splitting in this exercise). For each model, examine the residuals
against the fitted values and check modelling assumptions with QQ-plots and plots of the predictors against
the fitted values. Describe what you see. DO NOT include any of these figures in the report. That is,
write code to generate them, look at them, discuss them with words, then hide them from the rendered file
(eval=FALSE).
4. Final model inference/results Use cross validation to choose among the two models. Recall that npreg
produces a CV estimate for you (find where this is, and then you can use it). For your chosen model,
explain how each measure of educational effectiveness relates to economic mobility. What might explain
these patterns? I have included code to generate a figure with Student_teacher_ratio on one axis,
School_spending on the other, and colors to correspond to predicted values. Why is it easier to use
the plot to demonstrate conclusions than the coefficients or npreg summary? Re-make your map from
the EDA section, but plot the residuals for the color. If the model explained the geographic differences
well, what sort of patterns should we (or shouldn’t we see)? Are there patterns?
Grading rubric
Words (20 / 20) The text is laid out cleanly, with clear divisions and transitions between sections and
sub-sections. The writing itself is well-organized, free of grammatical and other mechanical errors, divided
into complete sentences logically grouped into paragraphs and sections, and easy to follow from the presumed
level of knowledge.Numbers (5 / 5) All numerical results or summaries are reported to suitable precision, and with appropriate
measures of uncertainty attached when applicable.
Pictures (15 / 15) Figures and tables are easy to read, with informative captions, axis labels and legends,
and are placed near the relevant pieces of text or referred to with convenient labels.
Code(30 / 30) The document correctly knits to html (20 points). The code is formatted and organized so that
it is easy for others to read and understand. It is indented, commented, and uses meaningful names. It only
includes computations which are actually needed to answer the analytical questions, and avoids redundancy.
Code borrowed from the notes, from books, or from resources found online is explicitly acknowledged and
sourced in the comments. Functions or procedures not directly taken from the notes have accompanying tests
which check whether the code does what it is supposed to. The text of the report is free of intrusive blocks of
code. With regards to R Markdown, all calculations are actually done in the file as it knits, and only relevant
results are shown.
Requirements (40 / 40) All directions have been followed. All required materials are incorporated into the
solution.
Content (20 / 20) Other analyses or exercises are clearly justified with sound reasoning. Statistically bogus
claims are avoided. Directions are followed in all respects. All numeric summaries are performed automatically.
No results are hard-coded into the document.
Analysis (20 / 20) Limitations from un-fixable problems are clearly noted. The substantive questions are
answered as precisely as the data and the model allow. The chain of reasoning from estimation results about
models, or derived quantities, to substantive conclusions is both clear and convincing. Contingent answers (“if
X, then Y, but if Z, then W”) are likewise described as warranted by the model and data. If uncertainties
in the data and model mean the answers to some questions must be imprecise, this too is reflected in the
discussion.
Extra credit (0 / 0) Up to fifteen points may be awarded for reports which are unusually well-written,
where the code is unusually elegant, where the analytical methods are unusually insightful, or where the
analysis goes beyond the required set of analytical questions.
Grade: 150 out of 150 possible