MGTF-415 Homework 1: Real Estate
This assignment builds off of Lectures #2 and #3. Students should upload their assignments to Canvas by 11:59pm on October 22. Save your assignment as a .pdf file, and include your code in the same file as your homework assignment. If you are using Python with a Jupyter Script, recall that you can install new modules by inserting the following commands at the beginning of your script.
import sys
!{sys. executable} -m pip install your_module_name
1 House Price Dynamics
In this exercise, you will study how the riskiness of returns on residential real estate varies with dis- tance to the city center. As in lecture, you will perform this analysis using the FHFA All-Transactions Price Index, which covers single-family homes. If you are using Python, you will likely need to import the module pandas.
(a) Write a command that will download and read a .csv file of price returns by census tract (i.e. submarket) and year from the FHFA’s website. Use the same link as in lecture.
(b) Filter your dataframe. such that the first year is 1990. Then, convert the data type of the variable annual change to a float.
(c) Read the crosswalk file tract-metro-crosswalk. csv and create a new dataframe consisting of a list of census tracts in San Diego. Note that San Diego’s metro code is 41740.
(d) Read the auxiliary dataset tract-centrality. csv and drop observations with missing data on dis- tance to the city center.
(e) Merge the dataframe on centrality with the dataframe containing the list of San Diego census tracts, using an inner join. Should we use an inner join if we wanted to include non-San Diego census tracts in our analysis?
(f) After performing the merge, sort census tracts into 5 quintiles based on their distance to San Diego’s city center. In lecture, we made an analogy to the stock market by calling census tracts “stocks”: if there is substantial house price variation within census tracts, is this analogy still appropriate? What would be the more appropriate analogue to a “stock” if there is substantial house price variation within census tracts?
(g) As in lecture, calculate the average annual house price return for each quintile. Then, calculate the standard deviation of this return for each quintile. If you are using Python, this can be accomplished through a command of the form.
dfSD. groupby(’quintile’)[’annual_change’] . std()
(h) What is the relationship between average return and distance to the city center? What is the relationship between standard deviation of return and distance to the city center?
(i) Investors typically are risk-averse. Given risk aversion, why does the pattern you observed in (h) make sense?
(j) A commonly-used measure of risk-adjusted return is the ratio of average return to standard deviation (i.e. volatility) of return. This ratio is sometimes called the Sharpe ratio. Calculate the Sharpe ratio for each of the five quintiles. Why does this ratio capture “risk-adjusted” return?
(k) What is the relationship between the Sharpe ratio and distance to the city center?
(l) One interpretation of the result in (k) is that it is more profitable to invest in the downtown relative to the suburbs on a risk-adjusted basis. However, in lecture we mentioned that price growth is one component of total return. What is the other component? Suppose we had data on this other component. How could we use that information to assess whether it is indeed more profitable to invest in the downtown?
2 Mortgage Underwriting
In this exercise, you will study the probability a mortgage loan application is denied based on (a) the applicant’s loan-to-income ratio (LTI); and (b) the applicant’s race. As in lecture, you will perform. this analysis using data from the Home Mortgage Disclosure Act (HMDA). If you are using Python, you will likely need to import the following modules: requests, zipfile, io, pandas, numpy, statsmodels.api, and matplotlib.pyplot.
(a) Read the file hmda-2016. csv as a dataframe. This file corresponds to a cleaned version of the HMDA dataset discussed in lecture. If you are using Python, this can be done through the command
import pandas as pd
df = pd. read_csv(’hmda-2016. csv’)
(b) Create a new dataframe consisting of a random 80% subsample of the full data. Be sure to sample without replacement. Why is it important to sample without replacement?
(c) Filter the dataset defined in (b) such that: the edit status is empty; the loan is for the purchase of a home; the property is single family (i.e. one-to-four family); the property is owner-occupied; and the borrower’s income is non empty.
(d) Create the following new variables: an indicator for whether the loan application was denied; the borrower’s loan-to-income ratio (LTI); and an indicator for whether the borrower is African- American or Hispanic. To construct the last variable, review the HMDA code sheet here, and note the value for Race which codes whether the borrower is African-American and the value for Ethnicity which codes whether the borrower is Hispanic.
(e) Use a logistic regression to estimate the probability that a borrower is denied as a function of her loan-to-income ratio (LTI) and a constant. Note that this regression is similar to that estimated in lecture, except that the outcome variable differs (i.e. loan denial vs. loan sale). What are the estimated coefficients on the variables in your regression (i.e. β0 , β1 )? Using these estimated coefficients, calculate each loan application’s probability of being denied.
(f) What is the sign of the coefficient on the borrower’s loan-to-income ratio (LTI)? Provide an economic reason for why you obtain this result. Relative to using a logistic regression, what would be one advantage to using an OLS regression to estimate the probability of loan denial?
(g) Using a similar methodology as in part (e), use a logistic regression to estimate the probability that a borrower is denied as a function of her loan-to-income ratio (LTI), a constant, and an indicator for whether the borrower is African-American or Hispanic. What are the estimated coefficients on the variables in your regression? Using these estimated coefficients, calculate each loan application’s probability of being denied.
(h) What is the sign of the coefficient on the borrower’s status as African-American or Hispanic? Does this result provide evidence that lenders practice racial discrimination? What additional variables would you need to test whether loan officers indeed discriminate against African-American or Hispanic borrowers?
(i) Plot a histogram of the probability of loan denial using the values calculated in part (g). Do the same using the values calculated in part (e). Why do the two distributions look different? Which model provides a more accurate prediction?
(j) In part (c), we restricted our analysis to owner-occupied properties. Why might the loan-to- income ratio be less informative for non-owner-occupied properties?