data留学生编程辅导、R程序语言调试、R编程讲解辅导Python编程|讲解Database

Assignment 2
In 2009, the state of North Carolina released to the public a large data set containing
information on births recorded in this state. This data set has been of interest to medical
researchers who are studying the relation between habits and practices of expectant
mothers and the birth of their children.
In this assignment, we will focus on studying how smoking affects the birthweight of a
newborn infant. Instead of providing the entire data set, we will work with a sample of
1,936 observations. The data set is available on Blackboard (2009Births). The following
variables were recorded:
Bmonth Birth month
Bday Birth day of the month
Gender Gender of baby
Fage Father’s age (years)
Mage Mom’s age (years)
Feduc Father’s education (years)
Meduc Mother’s education (years)
TotPreg Total number of pregnancies (number of pregnancies including current)
Visits Pre-delivery doctor visits
Marital Marital status (0=married, 1=unmarried)
Hispmom Hispanic mom
Hispdad Hispanic dad
Smokes Mom’s smoking habits (0=nonsmokers, 1=smokers)
BirthWeight Weight of baby at birth (grams)
a) What is the treatment variable? What is the outcome variable? How many covariates
are involved in the dataset? Is this study a randomize experiment or an
observational study?
b) Let us first visualize univariate balance of the data. Compare the histograms of
the variable ”Meduc” in the treated and control group. You may want to use
par(mfrow=c(2,1)) in R to stack the two histograms to make a clear comparison.
Also, the two histograms should have the same range on the values of ”Meduc” to
be comparable. What can you conclude from the comparison of the two histograms?
Consider such comparison for the variables ”Bmonth”, ”Mage”, ”Hispmom” separately.
What can you conclude for those variables?
c) Use a table to show some measures of balance on all covariates (all variables that
are not the treatment assignment or the outcome vectors). The measures should
1
include: 1. the mean and log standard deviations of each covariate in the treated
and control group; 2. the normalized difference (∆); 3. the log ratio of standard
deviations (Γ); and 4. ˆπ
0.05
c and ˆπ
0.05
t
. These measures are exactly the same as the
those in Table 14.4 on the textbook and our lecture notes. What can you conclude
from the table?
d) Now consider the estimation of the propensity score. Start with basic covariates
”Marital” and ”Meduc”, use logistic regression and likelihood ratio test to select
important linear terms of the covariates for the estimation of the propensity score.
What are your selected covariates?
e) For the covariates you selected in (d), consider their second-order terms (pure
quadratic terms and interactions). Which terms are significant for the estimation of
the propensity score?
f) Now use the significant terms you selected in (d) and (e) to estimate the propensity
score via logistic regression. Provide two histograms to show the distribution of the
linearized propensity scores in the control and treated groups. From the comparison
of the two histograms, do you think the data generally have good balance between
the treated and control groups?
g) What are et = mini:Wi=1 eˆ(Xi) and ec = maxi:Wi=0 eˆ(Xi)? Trim off the units with
estimated propensity scores less than et or greater than ec. Name the trimmed data
as trimmed.dat in your code. How many control and treated units are included in
the trimmed data? Provide two histograms to show the distribution of the linearized
propensity scores in the control and treated groups in the trimmed data.
h) Use the iterative blocking method provided in class to block the trimmed data. How
many blocks do you obtain?
i) Use a table to show the comparison of the values of normalize difference (∆) for
the full data, the trimmed data, and the blocked data (similar to Table 17.1 on the
textbook and lecture slides). Does trimming or blocking improve the balance of the
data?
j) For you blocked data obtained in (h),
(j)[1/Nc(j) + 1/Nt(j)]
in a Q-Q plot, where j denotes the jth block and k represents the kth covariate.
k) Provide the minimum, maximum, and standard deviation of the weights in the
Horvitz-Thompson and subclassification estimators for the blocked data. You can
arrange your results in a table like Table 17.8 on the textbook and lecture notes.
l) Provide the estimation of treatment effect using the Horvitz-Thompson and subclassification
estimators. Conduct a hypothesis test based on the subclassification
estimator and conclude whether smoking affects the birthweight of a newborn infant.
m) Provide the estimated bias, sampling variance, and MSE for the Horvitz-Thompson
and subclassification estimators for the blocked data. You can arrange your results
in a table like Table 17.10 on the textbook and lecture notes.