To get full credit for a response, your answer must be correct, and your code must follow the style. guide
covered in lecture note 3.
The problem set is worth 10 points total. Name your submission files ps3.rmd and ps3.html (1 point). Knit
and push to github early. Zero credit will be given for late problem sets.
In several places, the problem set asks you to explain something “in your own words”. In this situation,
you should not copy-paste a definition you find online. Instead, do enough reading online until you feel like
you understand the concept. Then, call your mother (or your roommate or your friend who is not a Harris
student) and explain each concept to her. Finally, write down the concept, as you explained it to your mother.
1 Public Sector Application: Home Health Care Data
The Department of Health and Human Services Office of the Inspector General is interested in identifying
home health agency that are potentially over-billing Medicare. Here is the report we will learn more about on
Wednesday. However, nursing homes could also be billing a lot because they have unusually sick patients. We
wouldn’t want to punish nursing homes that are taking sicker patients (just like in Jeff Liebman’s statelandia
example).
Medicare has developed a set of Home Health Resource Group (HHRG) codes to address this issue. For each
group, they have a Hierarchical Condition Category (HCC).
1.1 Misc
1. Who did you work with?
1.2 Data Ingestion (1 point)
1. Find and download Medicare’s home health public use files (PUF) by provider for 2014. Open the
file in Excel. What does it mean when there is a “*" listed in this file? Read the file into R. You will
get some warnings when you read it in. What are the warnings? Name the data frame. prov. Use
test_that to check that you have 10882 rows.
1. Find and download Medicare’s home health PUF by provider and HHRG for 2014. Read it into R and
name it prov_hhrg. Use test_that to check that you have 105400 rows.
2. Download Medicare’s 2014 case-mix weights using the file here. Name the data frame. cm_wt. Name
the variable for 2014 weights hcc. Drop the column named “2013 HH PPS Case-Mix Weights”. Use
test_that to check that you have 153 rows.
1.3 Institutional Research (1 point)
All administrative datasets contain acronyms and reflect concepts that will take some time to understand.
As a data analyst, you will always know less about the institutions than the people who run and staff them.
However, program staff have limited time to talk to you and so you need to figure out as much as you can
from Google before you meet with them.
1. What are five examples of services that home health agencies bill Medicare for?
2. In your own words, explain what is an HHRG code.
3. In your own words, explain what is an HCC score. (For the purposes of this problem set, we will use
the terms “case mix weight” and “HCC score” interchangeably.) What does it mean for a patient to
have a score of 0.75? of 1.5?
4. In your own words, explain what is a LUPA is.
1.4 Data Validation (1 point)
1. Using public documents, calculate how many people received home health care benefits from Medicare
in calendar year 2014. Compare this to the total number of beneficiaries in prov and in prov_hhrg.
Do the numbers from these sources align? If not, why do you think they might not align?
2. Compare the total number of episodes in prov and in prov_hhrg. Do the numbers from these sources
align? If not, why do you think they might not align?
3. What two columns uniquely identify all the rows in prov_hhrg? (Hint: use n_distinct to document
your answer.)
1.5 Merge Provider Costs with Provider Case-Mix (3 points)
To assess whether a provider is overbilling, we need merge the HCC weights in cm_wt on to prov_hhrg. This
requires some syntax that we haven’t covered yet, so we will walk through it step-by-step.
1. Google to find the “tidy data” vignette and read it (or read R4DS chapter 12). What does the separate
command do? What are the required arguments? What does the sep option do?
2. Google to find the dplyr two-table verbes vignette and read it (or read R4DS chapter 13). What does
the inner_join command do? What are the required arguments? Find and link to a webpage which
compares the merge commands in stata to the join commands in dplyr. Do dplyr joins generate a
variable similar to _merge in stata?
3. Google to find the stringr vignette and read it (or read R4DS chapter 14). What does the str_sub
command do? What are the required arguments? What does the str_trim command do?
4. There is no common identifer between the two datasets. However, there is common information that
can be used to link the two datasets. Review both datasets using the View statement. What five types
of information are available in both datasets? In your own words, explain what of these five types of
information are.
5. Which column(s) do you plan to merge on from prov_hhrg? How many distinct HHRG groups are
there using this (these) column(s)?
6. Take the column(s) you chose from prov_hhrg and use separate and/or str_sub to add five new
columns – one for each of the information types you listed in #1 above. Be sure to apply str_trim to
any columns which contain text.
1. R will likely throw a warning “Too many values...” followed by a series of row numbers. List
three of the row numbers returned in the warning message. Use filter(row_number() == xx)
to check these rows by hand. Why did separate or str_sub throw a warning for these rows? Do
you think it makes sense to drop these rows? Why or why not?
7. Which column(s) do you plan to merge on from cm_wt? How many distinct HHRG groups are there
using this (these) column(s)?
8. Take the column(s) you chose from cm_wt and use separate and/or str_sub to add five new columns –
one for each of the information types you listed in #1 above. Be sure to apply str_trim to any columns
which contain text. Be sure to use the same five column names as you did in the previous question.
9. A successful merge requires both datasets to have the same values in addition to the same column
names. For each of the five new columns, run count in both datasets. Which of the column(s) have the
same values in both datasets? Which of the column(s) have similar values, but require further cleanup?
Read about the fct_recode and fct_collapse commands in section 15.5. Use these two commands
to fix the columns in cm_wt to ensure that the five columns have identical values to prov_hhrg.
10. Create a new df called prov_hhrg_wt by inner joining cm_wt to prov_hhrg. Here are two tests to check
that your merge worked: (a) use test_that to check that prov_hhrg_wt has 105400 rows, (b) use the
count(is.na(hcc)) command to show that hcc is non-missing for all the rows.
1.6 Billing Outlier Analysis (3 points)
Construct a dataset prov_sum with one row per Provider ID and the following columns: Provider ID, agency
name, state, average HHA medicare payment amount (weighted by total episodes), average HCC score
(weighted by total episodes) and the number of total episodes.
For each question below, please follow the four-part approach laid out in lecture. I have given you the question
(step 1). You should write out your query (step 2), show the plot or table that resutls from this query (step 3),
and write out the answer to the question in a sentence (step 4).
1. Question: How much variation is there in average cost per episode by home-health agency?
2. Question: How much variation is there in average cost after accounting for case-mix weight? Show
three different ways to depict the covariation of these two variables. Then explain which plot you prefer
to answer the question and why.
3. For each HHA, construct a new “normalized cost” variable which is the ratio of average cost to the
average case-mix weight. Question: How much of the variation in average cost per episode across
HHAs is accounted for by differences in case-mix? (Hint: Find and link to a stack overflow thread on
overlaying histograms with ggplot2.)
4. Question: What are the top 5 HHAs with the highest billing per episode in Illinois? What are the top
5 HHAs with the highest billing per episode after normalizing for case mix in Illinois? Is there any
overlap between these two lists? What might happen if OIG decided to try to push down costs at the 5
HHAs with the highest billing per episode in Illinois?
5. For each HHA, construct a new variable “possible overbilling per episode” which is average amount
paid by Medicare to the HHA minus the normalized cost variable you built in 3. Question: Summing
over all episodes, which 5 HHAs in Illinois have the most possible over-billing? When would this list be
useful and when would the list in your response to question (4) be useful?