IPAL Programming in R Week 3

Programming in R - Week 3 Assignment

IPAL - The University of Chicago

Due: Sunday, July 19th at 11:59pm on Canvas

Structure

This assignment will focus on using real-world data to create a presentable, reproducible plot and regression

table that can be updated easily. As we covered in class, R Markdown is ideal for programmatically generating

reports that rely on constantly changing data. We will use this to create a report on the geographic distribution

of crime in Chicago, focusing on thefts. The end goal is to create a map, a regression table, and some short

writing discussing the implications of both.

Like before, this problem set is fairly open ended. It will be broken into three sections: Section 1 is worth 20

points, Sections 2 & 3 are worth 14 each. Start by creating a new project/folder for this assignment. The

output for this assignment should NOT be a .R file. Instead, it should be two separate files: a .Rmd file with

your raw code, and a finalized PDF or HTML document. The latter should include your code (in chunks,

with comments) as well as your plot, table, and analysis. e.g.

x <- c(1:10,20:30) #Here's a random vector

y <- 8 * x #Multiplying every element of the vector by 8

mean(y) #Finding the mean value of all values in the vector 'y'

## [1] 125.7143

A part of the grade for this assignment will be based on the tidyness and quality of your output file.

Section 1: Data Loading and Prep

This problem set will focus on mapping and analyzing thefts in Chicago. We can gather data about thefts

using the same Chicago Data Portal API that we used for Problem Set 2. Start by altering your API URL to

pull thefts instead of homicides. Be sure to increase limit returned by the API, as there are many more thefts

than homicides. Next, complete the following tasks:

1.1. Write a function named download_thefts that uses the read_json() function from jsonlite to

download any given year of crime data. The function should have a single input: year, and should output a

dataframe of thefts for that year.

1.2. Create a vector of years starting in 2016 and ending in 2019. Use your function download_thefts and

a for loop to iterate through each year in the vector and download the data relevant to that year. Use

bind_rows() to combine the data for each year into a final dataframe called thefts. You can also use the

map family of functions to complete this task. NOTE: You may need to drop column 22 from each year’s data

to avoid errors.

1.3. Using the thefts dataframe, use the same lubridate functions you used in Problem Set 2 to extract

the year, month, day, week, and hour columns. Additionally, drop any rows that have an NA value for the

latitude or longitude columns.

1.4. Create a new column called classification in the thefts dataset. Using ifelse() or case_when(),

set classification equal to “petty” when the description column equals less than $500, pocket-picking,

or purse-snatching. Set classification equal to “grand” for all other values of description.

Section 2: Mapping

Question 2.1

Now that we’ve loaded and cleaned our dataset, let’s take a look at the geographic distribution of thefts. We

need to convert the latitude and longitude columns in the data into spatial geometries (points) before

plotting. Use the st_as_sf() function to convert the respective columns into an sf geometry column. Specify

the CRS as 4326 when converting, and use the remove = FALSE argument to keep the original latitude and

longitude columns in the data. You will have a new column in your dataset named geometry if successful.

Question 2.2

Next, create a filtered version of the thefts dataset that contains only data from the first two months of

2019. Call this dataset thefts_fil.

Using thefts_fil, replicate the plot below using ggplot() and geom_sf().

Theft

Category

Grand

Petty

Thefts in Chicago (Jan. & Feb. 2020)

Source: City of Chicago Data Portal

This plot is still pretty hard to read, and it’s difficult to discern what conclusions we should draw from it. To

make the map clearer, we can aggregate the individual-level data to the Census tract level and examine data

from a longer time period.

Question 2.3

Start by downloading Census tracts for Cook County, IL using tidycensus. We want to get the geometries

for each tract, so be sure to set geometry = TRUE when using get_acs(). We also want to retrieve the total

population (variable code: “B01001_001”) for each tract. NOTE: You should use the 2016 5-year ACS for all

your boundaries and variables in this problem set. Save your Census tract data to a dataframe named cook.

Our goal is to determine which Census tract each theft occured in. To do so, we need to perform a point-inpolygon merge. Use st_join() to perform a point-in-polygon merge of the full thefts dataset and cook.

You may have to change the CRS of cook to 4326 before performing the join. You can do this with the

st_transform() function. The lecture script contains a relevant example of the format for a point-in-polygon

merge. Save the result of your point-in-polygon merge to a dataframe named thefts_merged.

Question 2.4

The thefts_merged dataframe should now be a combination of the original thefts data and data from each

theft’s respective Census tract, including the tract’s GEOID. We can aggregate by this GEOID column to

get various summary statistics for each tract. Before aggregating however, get rid of the geometry column.

The geometry column contains point geometries which are no longer needed after merging, and getting rid of

it will speed up future operations. Set the geometry column to NULL using the standard assignment operator

(<-) or st_set_geometry(). Next, use group_by() and summarize() to get the average number of thefts

per year for each Census tract. Assign the result to a new dataframe called thefts_agg.

Finally, join thefts_agg back to the cook dataframe using a simple left_join(). Drop any rows with NA

values from your resulting dataset. In the joined data, create a new variable called thefts_pc equal to the

number of thefts per capita for each tract. Finally, replicate the following map to the best of your ability.

0.1

0.2

Avg. Thefts

Per Capita

Per Year

Thefts in Chicago (2016 − 2020)

Source: City of Chicago Data Portal

Briefly answer the following questions in the text of your Rmarkdown document.

• Why do you think thefts per capita is higher in the Loop and northwest side (the second- and first-most

red areas respectively)?

• What changes could we make to the map to further clarify the spatial distribution of thefts?

Section 3: Regression Analysis

Here, let’s try to formalize/test some of your answers to the questions above by running a regression.

Use the tidycensus package to retrieve median household income (“B19013_001”), percent white

(“B02001_002”), percent below the poverty line (“B17007_002”), and percent with a bachelor’s degree

(“B23006_023”) for each Census tract in Cook County. You’ll need to calculate the percentage values by

dividing the number of people for each value by the total tract population (for example, 50 white people /

200 total people x 100 % = 25% white). Also, notice that the dependent variable is now “Average Thefts per

1000 people per Year” in order to make the regression tables more concise.

Question 3.1.

Do your best to reproduce the regression results table below (Table 1) using the stargazer package/function.

Provide a brief interpretation of the results and comment on whether the coefficients seem plausible.

Table 1: Regression Results

Average Thefts per 1000 per Year

Population n 0.001∗∗

(0.000)

Median Household Income (1000s) 0.141∗∗∗

(0.043)

Pct White e 0.276∗∗∗

(0.031)

Pct Poverty 0.513∗∗∗

(0.177)

Pct Bachelor’s 0.314∗∗∗

(0.072)

Constant 17.534∗∗∗

(2.778)

Observations 849

R2 0.167

Notes: ∗∗∗Significant at the 1 percent level.

∗∗Significant at the 5 percent level.

∗Significant at the 10 percent level.

Question 3.2

Choose an additional variable from the Census that you think are relevant to include in your regression. You

can download a list of available variables and their associated codes by using the following command and

using the RStudio’s “filter” function to search for variables.

v18 <- load_variables(2018, "acs5", cache = TRUE)

view(v18)

Describe your chosen variable and your hypothesis for why it matters. Add the variable into your regression

and present the results as a new column in the stargazer table from 3.1. How did the results change if at

all? Does the difference (or lack thereof) between your new and old regressions imply anything about our

previous results?

Question 3.3

Name and briefly describe an additional change to the model that could help us get a better idea of the true

average effect of the independent variables on average thefts per 1000 per year.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

mgt202辅导、讲解 java/pytho... 2025-06-28
讲解 pbt205—project-based l... 2025-06-28
辅导 comp3702 artificial int... 2025-06-28
辅导 cs3214 fall 2022 projec... 2025-06-28
辅导 turnitin assignment讲解... 2025-06-28
辅导 finite element modellin... 2025-06-28
讲解 stat3600 linear statist... 2025-06-28
辅导 problem set #3讲解 matl... 2025-06-28
讲解 elen90066 embedded syst... 2025-06-28
讲解 automatic counting of d... 2025-06-28
讲解 ct60a9602 functional pr... 2025-06-28
辅导 stat3600 linear statist... 2025-06-28
辅导 csci 1110: assignment 2... 2025-06-28
辅导 geography调试r语言 2025-06-28
辅导 introduction to informa... 2025-06-28
辅导 envir 100: introduction... 2025-06-28
辅导 assessment 3 - individu... 2025-06-28
讲解 laboratory 1讲解留学生... 2025-06-28
辅导 ct60a9600 renewable ene... 2025-06-28
辅导 economics 140a homework... 2025-06-28

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！