ESB - Graded Exercise 2019
Question 1 (25 points)
Answer the following questions. Be concise and to the point.
(a) Suppose you estimate the gender difference in returns to education using the
following model:
log(𝑤𝑎𝑔𝑒) = (𝛽0 + 𝛿0𝑓𝑒𝑚𝑎𝑙𝑒) + (𝛽1 + 𝛿1𝑓𝑒𝑚𝑎𝑙𝑒)𝑒𝑑𝑢𝑐 + 𝑢
where wage is the hourly wage, female is a gender dummy, which is =1 if the
individual is female, and educ is the number of years of education. Provide an
interpretation if 𝛿0 < 0 and 𝛿1 < 0. [5 points]
(b) Someone asserts that expected wages are the same for men and women who have the
same level of education. Referring to the model in part (a), what would be your null
hypothesis to test this? How you would test it. [5 points]
(c) Suppose your estimation returns the following values for the model from part
(a): 𝛿̂0 = −0.1, 𝛿̂1 = −0.01. Based on this, what is the expected wage differential
between a man and a woman with 10 years of schooling?
(d) Suppose you find in addition that 𝛽1 = 0.01. What does it imply about the effect of 5
years more of education on the expected wage of a woman?
(e) Suppose we have estimated the following wage equation
𝑊 = 10 + 10𝐴𝐺𝐸 − 0.1𝐴𝐺𝐸2 + 𝜖
Based on this, at what age would we expect the highest wage? [5 points]
Question 2 (25 points)
Consider the dataset ets_thres_final.dta. It contains emission figures (lnco2=log of CO2
emissions) for a sample of firms regulated by the European Emissions Trading System
(EUETS) for the years from 2005 to 2017 although the firm identifiers have gone missing
from the dataset. Note that an Emissions Trading System requires firms to buy permits for
every unit of CO2 they emit. By restricting the total number of permits that are issued
governments can control the total amount of emissions while allowing firms to trade permits
freely so that they can be used with those businesses that find it hardest to reduce emissions.
In the early days of the EU ETS (which started in 2005) permits where freely given to firms.
This changed from 2013 onwards when permits where only given to certain firms and sectors
that were deemed at risk from foreign competition. The variable free indicates those firms in
the dataset. According to economic theory the method of permits allocation should have no
effect on the eventual emissions by firms (Independence hypothesis). Firms that have been
given free permits will have an incentive to reduce emissions as that frees up permits to sell
within the permit market.
(a) Examine this hypothesis by running a regression of lnco2 on the free variable. Report
what you find. [5 points]
(b) Provide an interpretation of the regression coefficient along with a discussion of the
implications of your result. [5 points]
(c) The variable period is a categorical variable equal to 1 for observations from before
2013 and equal to 2 for observations from year 2013 onward. Convert it into a factor
variable and run a regression of lnco2 on period. Provide an interpretation of the
estimated coefficients [5 points]
(d) Would you say your results in part (a) provide a causal estimate of the effect of free
permits? [5 points]
(e) With the data at hand can you propose and implement an alternative regression
approach that might address some of the concerns raised in (d)? If yes, implement this
regression and discuss its results. What does the result tell you about the
Independence hypothesis discussed in the introduction. [5 points]
Question 3 (25 points)
For this question use the dataset hals1prep.dta, containing data from the UK Health and
Lifestyle Survey (1984-85). In this survey, several thousand people in the UK were being
asked questions about their health and lifestyle.
(a) The variable bmi records the body mass index (BMI) of the respondents. The BMI
uses the weight and height to work out whether a weight is healthy or if someone is
overweight. A value between 18.5 and 24.9 indicates a healthy weight. Based on the
information below, which region of the UK had – on average – the most overweight
population? Run a regression of BMI on regional categories (recorded in the variable
region). Use this to figure out in which UK regions are on average outside the healthy
BMI range. [5 points]
b) The variable ownh_num records responses to the question “Would you say that for
someone of your age your own health in general is…” where users had the following
response options:
• Excellent (1)
• Good (2)
• Fair (3)
• Poor (4)
The numbers in brackets indicate how these options were recorded in the ownh_num
variable. Run a regression of ownh_num on bmi and provide a discussion of what
you find. Is it in line with your expectations on this? [5 points]
c) Can you think of at least two reasons why the estimate in b) does not provide a correct
representation of the causal relationship between bmi and health? [5 points]
d) The dataset includes several additional control variables. These include
• incomeB a categorical variable representing income brackets where “1”
represents the lowest and “12” the highest income group.
• agyrs – a variable recording the age of the participant
Include those in the regression of reported health from b) Discuss what the output
suggests about the relationships between health and age, and health and income. Are
they in line with what you would have expected? In each case can you provide an
explanation for the kind of relationship found?
Also discuss the usefulness of including both the age and income controls for
estimating the causal effect of BMI. In each case discuss at least one reason for and
one reason against including these controls. [5 points]
e) Consider the R output below. It builds a new dataframe as a transformation the
dataframe halsx with the health survey data. ownh_num is defined as in b). Can you
provide an interpretation for the coefficients of the linear regression reported at the
end of R output? Note that the rbind() command combines dataframes vertically. [5
points]
halsx=read_dta("../data/hals1prep.dta")
labels=c("excellent", "good", "fair", "poor")
for(i in 1:4){
fr=halsx
fr['dum']=fr$ownh_num==i
fr['label']=labels[i]
if(i==1){
longframe=fr
}
else {
longframe=rbind(longframe,fr)
}
print(nrow(longframe))
}
## [1] 8971
## [1] 17942
## [1] 26913
## [1] 35884
summary(lm(dum~label,longframe))
##
## Call:
## lm(formula = dum ~ label, data = longframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50864 -0.23141 -0.20622 0.08254 0.94627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.206220 0.004231 48.74 < 2e-16 ***
## labelfair 0.025192 0.005984 4.21 2.56e-05 ***
## labelgood 0.302419 0.005984 50.54 < 2e-16 ***
## labelpoor -0.152491 0.005984 -25.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4007 on 35880 degrees of freedom
## Multiple R-squared: 0.1436, Adjusted R-squared: 0.1435
## F-statistic: 2005 on 3 and 35880 DF, p-value: < 2.2e-16
Question 4 (25 points)
Air pollution has been shown to have a variety of adverse health effects. Recently,
researchers have also started to investigate other negative effects. Below we report regression
tables from a study that investigates a link between air pollution and car accidents.
(a) Can you suggest a causal mechanism that might explain why air pollution could have
an effect on car accidents? [5 points]
(b) Table 3 below, extracted from an academic paper, reports various regressions of the
log number of accidents per day across geographic grid cells for the UK over a period
from 2009 to 2014. Column 6 provides a simple OLS regression of accidents on
pollution concentration (measured as micro grams per cubic meter of PM). Can you
think of reasons why this might not be a valid estimate of the causal impact? [5
points]
(c) Column 7 of Table 3 in sub-question (b) repeats the same regression including various
variables measuring weather conditions as well as region interacted with year, month
and day of the week fixed effects/dummies. Would you say this provides a better
estimate of the causal effect of pollution? Could it also lead to a worse estimate? [5
points]
(d) The study proposes an instrument for pollution derived from a weather phenomenon
known as temperature inversion. Temperature inversion occurs from time to time
when a layer of warmer air sits on top of colder air nearer to the ground. As
consequence pollution is trapped near the ground and cannot easily escape. Thus, all
else equal, pollution will be more severe near the ground when this happens.
Meteorological studies suggest that the phenomenon is driven by wider movements in
the atmosphere and crucially is not itself driven by local pollution. Table 2 reports
regressions of the pollution variable from Table 3 on a binary variable that is equal to
1 if a temperature inversion is occurring in a particular area at a particular time.
Discuss what this table is telling us. [5 points]
(e) Columns 1 to 3 of Table 3 in sub-question (b) report 2 stage least squares regressions
using the temperature inversion as instrument. Discuss if this provides a better
estimate of the causal effect of pollution on accidents. Can you comment on the
relative size of the coefficients comparing columns 1 and 6? Are they in line with
what you would expect? [5 points]