The problem set is worth 10 points total. Name your submission files ps2.rmd and ps2.html (1 point).
Be sure to test your ability to knit and your ability to submit on github early! Zero credit will be given for
problem sets which are submitted late or not submitted in R Markdown + HTML format.
1 R for Data Science Exercises
1.1 Misc
1. Who did you work with?
1.2 5.6.7 (1 point)
1. Calculate the number of flights by each carrier. Report the results in reverse alphabetical order.
2. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
Write the dplyr code which calculates these 5 delay measures separately by airline. It might be helpful
to consider the following scenarios:
• A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
• A flight is always 20 minutes late.
• A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
• 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
3. Which is more important from a passengers’ perspective: arrival delay or departure delay? Explain
why you think this is a better measure. Regardless of your answer, please use arrival delay for the rest
of the problem set (so that the TAs can check your answers).
4. Come up with another approach that will give you the same output as not_cancelled %>%
count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).
5. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal.
Why? Which is the most important column?
6. Make a histogram with the proportion of flights that are cancelled each day. Is there a pattern? Is the
proportion of cancelled flights related to the average delay?
7. Calculate average delays by carrier. Create a variable which ranks carriers from worst to best, where 1
is the worst rank.
8. Calculate average delays by destination for flights originating in NYC. Create a variable which ranks
destinations from worst to best, where 1 is the worst rank.
1
1.3 5.7.1 (1 point)
1. Which plane (tailnum) has the most minutes of delays total? How many planes are delayed every time
they appear in the dataset?
2. What time of day should you fly if you want to avoid delays as much as possible?
3. For each destination, compute the total minutes of delay. For each, flight, compute the proportion of
the total delay for its destination.
4. Delays are typically temporally correlated: even once the problem that caused the initial delay has been
resolved, later flights are delayed to allow earlier flights to leave. Use lag() to explore how the delay of
a flight is related to the delay of the immediately preceding scheduled flight. Make a plot which shows
the relationship between a flight’s delay and the delay of the immediately preceding scheduled flight.
You have a lot of data, so think carefully about how to develop a plot which is not too cluttered.
5. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent
a potential data entry error). Compute the air time a flight relative to the shortest flight to that
destination. Which flights were most delayed in the air?
6. Find all destinations that are flown by at least two carriers.
7. For each plane, count the number of flights before the first delay of greater than 1 hour.
1.4 7.3 and 7.4 (1 point)
1.4.1 7.3.4
1. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully
think about the binwidth and make sure you try a wide range of values.)
2. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the
difference?
3. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram.
What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar
shows?
1.4.2 7.4.1
1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why
is there a difference?
2. What does na.rm = TRUE do in mean() and sum()?
1.5 7.5.2.1 (2 points)
Question: How does seasonality in delays vary by place?
1. First pass: make a data frame. with average delay by destination and by month of the year. Use
geom_tile() to make a plot of this data frame. What makes the plot difficult to read? (List as many
issues as possible.)
2. Make a new plot which resolves at least one (and ideally all) of the issues that you raised, but still
answers the broad question “How does seasonality in delays vary by place?” One thing you should be
sure to do is develop a strategy for limiting the number of categories on the y-axis to 20.
2
3. Write out in words the answer to the question. Be sure that these are conclusions that a reader can
draw directly from your second plot rather than things you happened to learn along the way.
2 Public Sector Application: Flight Data (4 points)
An international trade organization is hosting a two-day convention in Chicago in 2019. The mayor’s tourism
office has asked for some planning help based on historical data from 2016. Use the same data which you
analyzed for PS1, limiting the sample to flights to or from Midway and Ohare.
For each question, please follow the four-part approach laid out in lecture. I have given you the question (step
1). You should write out your query (step 2), show the plot from this query (step 3), and write out the answer
to the question in a sentence (step 4).
1. When are average arrival delays (measured using the arrival delay variable) the lowest? When are are
least 80% of flights on-time? Make a single plot that answers both questions and write a sentence (or
two) that answers these questions.
2. When are flights to Chicago most common? Make a plot to support your answer and write a sentence
to answer the question.
3. What date do you recommend they have the convention? Take into account both the number of flights
to Chicago and that people would like to arrive in Chicago on-time and not get delayed on the way
in (don’t worry about delays on the way home). Why did you recommend this date? Write a few
sentences.
1. In lecture, we covered the idea of “basic” plots and “sophisticated” plots. Make a “basic” plot which
provides just the minimal amount of information needed to support your written recommendation.
2. Make a “sophisticated” plot as well that contains more information about flight delays. What are
the sub-messages in the “sophisticated” plots that are not in the “basic” plot? If you could submit
only one of the two plots to the mayor’s office, which would you submit and why?
3. You have (hopefully) reached the frontier of what you can do to answer this question with the
data that you have. If you wanted to push the frontier further of figuring out when the convention
should be, what are two other public datasets that would be useful in making a decision? Include
links to the datasets and the names of the variables you would analyze. We do not expect you to
actually analyze these other datasets.
4. Now that you’ve decided when it will happen, please give the attendees a recommendation of which
airline to take in order to arrive on time. The attendees are not price-sensitive, so you don’t need to
worry about cost. Make a “basic” plot and a “sophisiticated” plot to support your recommendation.
Which plot do you prefer and why?
5. The trade organization sends an update. Some of its most important members are in Savannah, which
is an airport with a ton of delayed flights to Chicago. Does that change your recommendation of when
to host the convention? Make a plot that supports your new recommendation and shows why it is
superior to your old recommendation.