BUEC 333, R Lab 2: Returns to Education and College
Proximity as an IV
Due date November 28 or 29 in your tutorial
In this exercise you will replicate the results in the paper by Card (1995), Using Geographic Variation in
College Proximity to Estimate the Return to Schooling, found at link.
We will use the data Card used to look at the returns to education. This dataset contains information on a
sample of 3010 working men aged between 24 and 34 who were part of the 1976 wave of the US National
Longitudinal Survey of Young Men and it is usually studied to replicate the estimates of the earnings equations.
It includes a measure of the log of hourly wages in 1976 (lwage), a measure of years of education (educ),
years of labour market experience (exper), the square of years of labor market experience (expersq) and other
useful variables. We will see which ones those are below.
Using this data, Card (1995) uses a dummy variable that indicates if a man grew up close to a 4-year college
(nearc4) as an instrumental variable for years of schooling. He also includes other covariates like experience,
experience squared, a black indicator, southern and northern indicators, and regional and urban indicators
for 1966. His estimate of the returns to college is 13.2%, while the OLS estimate is 7.5%. So the IV estimated
returns to education is larger than the OLS one. One interpretation for why this could be the case is that
there may be measurement error in the measure for education (people lied about their education level).
Another interpretaion is that the instrument (nearc4) is not a valid instrument because it is not exogeneous
in the wage equation. The reason for this is that nearc4 could be correlated with the error term in the wage
regression (when regressing wage on education, the error term could include something that is correlated
with nearc4). This seems plausible if the error term contains ability or IQ. We will look at this case.
We will show that the instrumental variable, “nearc4”, is actually correlated with “IQ,” at least for the subset
of men for whom an IQ score is reported. However, the correlation between “nearc4” and “IQ” is close to zero
once other covariates are netted out. By close to zero we mean it is not statistically different from zero. In
other words, “nearc4” fails the exogeneity requirement in a simple regression model but it passes if controls
are added to the wage equation.
1. The data we will use is called “Card.dta” To get the data, go to R Studio, then go to File –> Import
Dataset –> From Stata... and now find the folder where your saved the data and import it into
R. If that doesn’t work, then open an R script. (Ctrl+Shift+N) and type in the R script. Card % filter(!is.na(IQ))
## extract specific columns from CardIQ
CardIQ <- CardIQ[,c(4,12:20,22,24,25,29,32,35)]
## create the correlation matrix
corrCard <- cor(CardIQ)
## create correlation plot
corrplot(corrCard, type="upper", col=c("black", "white"), bg="lightblue")
−1
0
1
educ reg661 reg662 reg663 reg664 reg665 reg666 reg667 reg668 reg669 b
lack
south smsa66 IQ e
xper
resids
educ
reg661
reg662
reg663
reg664
reg665
reg666
reg667
reg668
reg669
black
south
smsa66
IQ
exper
resids
If you look at the column called “resids,” you can see that not all squares are blue (which would mean zero
correlation with the variables on the diagonal). What can you say about omitted variables and omitted
variable bias?
From now on, I will write W for the vector of covariates: * exper, * expertwo.superior, * black, * south, * reg661, *
reg662, * reg663, * reg664, * reg665, * reg666, * reg667, * reg668, * smsa66.
4. Regress lwage on educ and W, i.e. reg1 <- lm(lwage ~ educ + W, data=Card) where you replace
the W by the names of the covariates (see above). Then generate a table with the results of the
regression, i.e. summary(reg1). Compare your results with those in Table 1, Column (1) at the end
of this document. Now look at Table 2 in Card (1995). What is the Column corresponding to this
regression?
(i) Save the residuals from this regression, subset your data to include only those observations for which IQ
is observed, save the correlation matrix, and then plot the correlation matrix with these new residuals.
What can you conclude about omitted variables now?
5. Obtain the 95% confidence interval for the coefficient in front of educ. Do this using R not by hand.
6. Estimate the regression for educ by regressing educ on W and nearc4, i.e. reg2 <- lm(educ ~ W +
nearc4, data=Card) Do educ and nearc4 have a statistically significant correlation? What can you
say now about the validity of the instrument versus your answer from part 3(v)? Compare your result
to Table 1, Column (2) at the end of this document. Then look at Table 3 in Card (1995). What is the
corresponding Column for this regression?
7. Estimate the log(wage) equation by two stage least squares using nearc4 as an instrumental variable for
educ. You should do this in two ways:
(i) First, use part 6 to get the predicted educ, and then regress log(wage) on pred_educ and W. To get
the predicted education you should write:
• pred_educ <- reg2$fitted.values
• reg3 <- lm(lwage ~ pred_educ + W, data = ...)
See Table 1, Column (3) at the end of this document.
(ii) Second, use the ivreg command, you should note that: reg4 <- ivreg(lwage ~ educ + W | W + Z,
data = ... ) where Z is the instrument. You can then see the results of your regression as a table:
summary(reg4).
See Table 1, Column (4) at the end of this document. Look at Table 3 in Card (1995). What is the Column
corresponding to this regression?
8. Obtain a 95% confidence interval for the coefficient in front of pred_educ, and compare it with the 95%
interval from part 4.
9. Now regress educ on W and on both nearc2 and nearc4. Which one of nearc2 and nearc4 is more
strongly correlated with educ? How does the Rtwo.superior of this regression compare to the one from part 6?
Compare your results to Table 1, Column (5) at the end of the document.
10. Now use both nearc2 and nearc4 as instruments for educ. That is, regress log wage on W by using
nearc2 and nearc4 as instruments. When you have more instruments, you need to write reg5 <-
ivreg(lwage ~ educ + W | W + nearc2 + nearc4, data = ... ).
Compare your results to Table 1, Column (6) at the end of the document.
10. For a subset of the men, we can observe IQ score. Regress iq on nearc4. Is iq correlated with nearc4?
11. Now regress iq on nearc4 and smsa66, reg661, reg662, and reg669. Are iq and nearc4 partially correlated?
What do you conclude about the importance of controlling for the 1966 location and regional dummies
in the log(wage) equation when using nearc4 as an IV for educ?
Table 1: BUEC333 Fall 2017: Irene’s Replication Results
Dependent variable:
lwage educ lwage educ lwage
OLS OLS OLS instrumental OLS instrumental
variable variable
(1) (2) (3) (4) (5) (6)
educ 0.076∗∗∗ 0.156∗∗∗ 0.174∗∗∗
(0.004) (0.053) (0.051)
pred_educ 0.156∗∗∗
(0.053)
exper 0.039∗∗∗ −0.400∗∗∗ 0.071∗∗∗ 0.071∗∗∗ −0.400∗∗∗ 0.078∗∗∗
(0.002) (0.009) (0.021) (0.021) (0.009) (0.021)
black −0.188∗∗∗ −0.915∗∗∗ −0.117∗∗ −0.117∗∗ −0.924∗∗∗ −0.100∗∗
(0.018) (0.094) (0.051) (0.052) (0.094) (0.050)
south −0.176∗∗∗ −0.113 −0.166∗∗∗ −0.166∗∗∗ −0.103 −0.164∗∗∗
(0.026) (0.135) (0.029) (0.029) (0.135) (0.030)
reg661 −0.125∗∗∗ −0.229 −0.109∗∗ −0.109∗∗ −0.187 −0.105∗∗
(0.039) (0.203) (0.044) (0.044) (0.204) (0.045)
reg662 −0.020 −0.273∗ −0.0002 −0.0002 −0.253∗ 0.004
(0.029) (0.148) (0.033) (0.034) (0.148) (0.035)
reg663 0.025 −0.231 0.045 0.045 −0.182 0.049
(0.028) (0.143) (0.033) (0.033) (0.146) (0.034)
reg664 −0.071∗ −0.114 −0.061 −0.061 −0.058 −0.059
(0.036) (0.186) (0.039) (0.040) (0.190) (0.041)
reg665 0.022 −0.442∗∗ 0.059 0.059 −0.396∗∗ 0.068
(0.037) (0.188) (0.046) (0.047) (0.190) (0.047)
reg666 0.038 −0.451∗∗ 0.079 0.079 −0.440∗∗ 0.089∗
(0.041) (0.209) (0.052) (0.052) (0.210) (0.053)
reg667 0.027 −0.359∗ 0.060 0.060 −0.309 0.067
(0.040) (0.205) (0.048) (0.048) (0.208) (0.049)
reg668 −0.182∗∗∗ 0.290 −0.202∗∗∗ −0.202∗∗∗ 0.360 −0.206∗∗∗
(0.047) (0.242) (0.052) (0.053) (0.246) (0.054)
smsa66 0.108∗∗∗ 0.255∗∗∗ 0.077∗∗∗ 0.077∗∗∗ 0.229∗∗ 0.069∗∗∗
(0.016) (0.087) (0.027) (0.027) (0.089) (0.027)
nearc4 0.347∗∗∗ 0.347∗∗∗
(0.088) (0.088)
nearc2 0.125
(0.078)
Constant 4.953∗∗∗ 16.933∗∗∗ 3.598∗∗∗ 3.598∗∗∗ 16.857∗∗∗ 3.282∗∗∗
(0.068) (0.161) (0.905) (0.911) (0.167) (0.875)
Observations 3,010 3,010 3,010 3,010 3,010 3,010
R2 0.276 0.475 0.167 0.156 0.475 0.093
Adjusted R2 0.273 0.472 0.163 0.152 0.473 0.089
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01