ETX2250/ETF5922 Data Visualisation and Analytics
ETX2250/ETF5922 Data Visualisation and
Analytics
Assignment 1: Analytics – Flights
Submission instructions
This assignment comprises 15% of the assessment in ETX2250 and
ETF5922.
Your assignment submission will consist of a pdf document. The
document must include both your code and graphical and other output
as well as paragraph answers that provide description and discussion
of the output. The presentation of graphs is important. In
particular, headings and labels should be provided when necessary.
In your completed assignment, graphs should be easy to read, with
appropriate use of headings, colour, formatting and text.
The pdf document must be submitted as a hard copy in class, and also
must be uploaded to Moodle.
The pdf document should be titled _A2.pdf
(Note: No spaces, No Names, No other characters, No extra
characters)
The assignment is due before class on Monday 4th February 2019 (9.00
am), and a hard copy of the pdf document should be submitted in
class, securely stapled in the top left corner and with a signed
COVER SHEET on top.
If for some reason you are unable to submit the assignment
personally, there will be an assignment box available on level 5 of
Building H to place it there before the due time.
The cover sheet is available at URL:
https://www.monash.edu/__data/assets/word_doc/0004/903379/assignment
-cover-sheet-fbe-1.doc you must supply the unit code and further
details.
Upload your assignment to Moodle as follows:
• Go to the Assignments section
• Click on Submission of Assignment 2: Analytics
• Click on Add submission
• Drag and drop the file to submit it
• Save changes
To confirm that your upload was successful, go to the Assignments
section, and click on the Assignment 2: Analytics link. The uploaded
filename will be shown.
Retain your marked assignment until after the publication of the
final results for this unit.
Assignment 2: Introduction
This assignment relates to the Analytics component of the unit.
The data-set is a subset of all flights departing and leaving the
two major airports in Chicago, USA in the year 2008. The task is to
investigate why some flights are getting delayed and build a
predictive model, which flights will be delayed. This assignment
will run you through the first steps of any Business Analytics
projects you plan on undertaking.
Assignment 2 Part A: Explore the data (4 Points)
The data is stored in a database (as in any company you will work).
The fields are:
1. flightstatus: 0 – normal flight 1- delayed
2. carrier: unique carrier code
3. dayofweek: 1 (Monday) - 7 (Sunday)
4. month: 1-12
5. departuretime: actual departure time (local, hhmm)
6. destination: destination IATA airport code
7. origin: origin IATA airport code
8. weatherdelay in minutes: Weather delay is caused by extreme or
hazardous weather conditions that are forecasted or manifest
themselves on point of departure, enroute, or on point of
arrival.
a) Connect to the data (1 Point)
The data is stored in the database summer2019 at the server with the
IP 118.138.234.161. You can access this postgres database with the
username student and the password 4XcxqUo6AHPn. First, create a
connection with the database by reading the username and password
from a csv file where you store the username and the password.
(Hint: Remember Workshop02)
b) Loading data (0.5 Points)
Load the data of the table flight_delay_chicago in the schema public
into R. Print out the first few rows.
c) Summary statistics (1 point)
Calculate the mean by the variable of interest flightstatus for all
numeric columns with SQL, what do you notice?
d) Correlations (1 Point)
Create a correlation plot between all numeric variables by dropping
all rows with missing values. Set all values in the variable
weatherdelay to 1 if the delay is greater than 0 for this exercise.
Split or analysis by origin airport. Display the correlation
coefficients on the plots.
e) Interpretation (0.5 Point)
Analyse the first results from the correlation plots in relation to
the variable flightstatus from d)
Assignment 2 Part B: Cluster the data (4 Points)
For this exercise, the goal is to cluster the data with kmeans based
on dayofweek, month and departuretime and analyse the results.
a) Normalize the departuretime by reducing the accuracy to
an hour, e.g. 2215 becomes 22 (0.5 Points)
b) Write the code to iterate over different numbers of
clusters. Graph your result of the appropriate metrics
and justify your choice of k. (1 Point)
c) Use your chosen k as the dependent variable and build a
decision tree classifier to explain the cluster
assignment. Print out the tree as well as the decision
rules. (1.5 Points)
d) Compare the kmeans clustering results across by cluster
groups by averaging across all variables. What variables
are important according to the decision tree ? (1 Point)
Assignment 2 Part C: Build a predictive model (4 Points)
The goal of this exercise is to build a classifier that can tell you
if a flight is most likely be delayed. For this, we split the data
into a training set with all the data up to September, and use
October, November and December as our test set. As we cannot know
the delay of the weather before it happened, recode it as 1 if the
delay is greater than 0. We will say in reality, we would look at
the weather forecast and assign a 1 if the weather conditions are
worrisome.
a) Cleaning the data and preparation (1 Point):
• Filter out all rows where the destination airport occurs less
than 100 times.
• Filter out all rows where the carrier occurs less than 100
times.
• Filter out all rows with missing entries
• Recode the variable weatherdelay as a factor with the levels 0
and 1
b) Building (1 Point)
Split the data into a train and test set by only selecting the
variables: flightstatus, carrier, dayofweek, departuretime,
destinaiton, weatherdelay. Build a decision tree on the training
test and predict on the test set (Remember the months). Print out
the decision tree as well as the rules. (1 point)
c) Plot the ROC and Precision-Recall curve on the testset.
What does it tell you? (1 Point)
d) Explain the result. What are the important
predictors according to the tree? How does it
compare to the clustering result? Why did we exclude
the month variable? (1 Point)
Overall presentation: 1 point