COMP 2019 Assignment 2 – Machine Learning

COMP 2019 Assignment 2 – Machine Learning
Please submit your solution via LEARNONLINE. Submission instructions are given at the end of this assignment.
This assessment is due on Sunday, 10 June 2018, 11:55 PM.
This assessment is worth 20% of the total marks.
This assessment consists of 6 questions.
In this assignment you will aim to predict if it will rain on each day given weather observations from the
preceding day. You will perform a number of machine learning tasks, including training a classifier, assessing
its output, and optimising its performance. You will document your findings in a written report. Write
concise explanations; approximately one paragraph per task will be sufficient.
Download the data file for this assignment from the course website (file weather.zip). The archive contains
the data file in CSV format, and some python code that you may use to visualise a decision tree model.
Before starting this assignment, ensure that you have a good understanding of the Python programming
language, the Jupyter Python notebook environment, and an overall understanding of machine learning
training and evaluation methods using the scikit-learn python library (Practical 3). You will need a working
Python 3.x system with the Jupyter Notebook environment and the ‘sklearn’ package installed.
Documentation that you may find useful:
• Python: https://www.python.org/doc/
• Jupyter: https://jupyter-notebook.readthedocs.io/en/stable/
• Scikit-learn: http://scikit-learn.org/stable/
• Numpy: https://docs.scipy.org/doc/
Preparation
Create a Jupyter notebook and load the data. Use
import numpy as np
data = np.loadtxt(‘weather.csv’,skiprows=1,delimiter=’,’, dtype=np.int)
to load the data. Type this code into the notebook. You will get syntax errors if you copy and paste from this
document. (Students familiar with the Pandas library may use that to load and explore the data instead.)
Familiarise yourself with the data. There are 44 columns and 2716 rows. All values are binary (0/1) where 0
indicates false and 1 indicates true.
Categorical variables were encoded using “One Hot” coding, where a separate column is used to indicate the
presence or absence of each possible value of the variable. For example, the three binary-valued columns
“MinTemp_Low”, “MinTemp_Moderate”,”MinTemp_High” correspond to the three possible values “Low”,
“Moderate”, and “High” of variable “MinTemp”. A 1 in column “MinTemp_Low” means that the value of
MinTemp was “Low”; the cells for the other two values must be 0 in this case.
Explore the distribution of data in each column.
The last column contains the prediction target (RainTomorrow).
The meaning of the columns is as follows:
• MinTemp_{Low,Moderate,High}: 1 if the minimum temperature on the day was low/moderate/high
• MaxTemp_{Low,Moderate,High}: 1 if the maximum temperature on the day was low/moderate/high
• Evaporation_{Low,Moderate,High}: 1 if the measured evaporation on the day was low/moderate/high
• Sunshine_{Low,Moderate,High}: 1 if the aggregated periods of sunshine on the day was
low/moderate/high
• WindSpeed9am_{Low,Moderate,High}: 1 if the measured wind speed at 9am on the day was
low/moderate/high
• WindSpeed3pm_{Low,Moderate,High}: 1 if the measured wind speed at 3pm on the day was
low/moderate/high
• Humidity9am_{Low,Moderate,High}: 1 if the humidity at 9am on the day was low/moderate/high
• Humidity3pm_{Low,Moderate,High}: 1 if the humidity at 3pm on the day was low/moderate/high
• Pressure9am_{Low,Moderate,High}: 1 if the barometric pressure at 9am on the day was
low/moderate/high
• Pressure3pm_{Low,Moderate,High}: 1 if the barometric pressure at 3pm on the day was
low/moderate/high
• Cloud9am_{Low,Moderate,High}: 1 if the cloud cover at 9am on the day was low/moderate/high
• Cloud3pm_{Low,Moderate,High}: 1 if the cloud cover at 3pm on the day was low/moderate/high
• Temp9am_{Low,Moderate,High}: 1 if the temperature at 9am on the day was low/moderate/high
• Temp3pm_{Low,Moderate,High}: 1 if the temperature at 3pm on the day was low/moderate/high
• RainToday: 1 if it rained on the day
• RainTomorrow: 1 if it rained on the following day. This is the target we wish to predict.
Question 1: Baseline
A simple model for predicting rain tomorrow is to use today’s weather (RainToday) as an indicator of
tomorrow’s weather (RainTomorrow).
What performance can we expect from this simple model?
Choose an appropriate measure to evaluate the classifier.
Select among Accuracy, F1-measure, Precision, and Recall.
Use a confusion matrix and/or classification report to support your analysis.
Question 2: Naïve Bayes
Train a Naïve Bayes classifier to predict RainTomorrow.
As all attributes are binary vectors, use the BernoulliNB classifier provided by scikit-learn.
Ensure that you follow correct training and evaluation procedures.
1. Assess how well the classifier performs on the prediction task.
2. What performance can we expect from the trained model if we used next month’s data as input?
Question 3: Decision Tree
Train a DecisionTreeClassifier to predict RainTomorrow. Use argument class_weight=’balanced’ when
constructing the classifier, as the target variable RainTomorrow is not equally distributed in the data set.
Ensure that you follow correct training and evaluation procedures.
1. Assess how well the classifier performs on the prediction task.
2. What performance can we expect from the model on new data?
If you wish to visualise the decision tree you can use function print_dt provided in dtutils.py provided in the
Assignment 2 zip archive:
import dtutils
dtutils.print_dt(tree, feature_names=flabels)
where tree refers to the trained decision tree model, and flabels is a list of features names (columns) in the
data.
Question 4: Diagnosis
Does the Decision Tree model suffer from overfitting or underfitting? Justify why/why not.
If the model exhibits overfitting or underfitting, revise your training procedure to remedy the problem, and
re-evaluate the improved model. The DecisionTreeClassifier has a number of parameters that you can
consider for tuning the model:
• max_depth: maximum depth of the tree
• min_samples_leaf: minimum number of samples in each leaf node
• max_leaf_nodes: maximum number of leaf nodes
Question 5: Recommendation
Which of the models you trained should be selected for the prediction task? Assume that all errors made are
equally severe. That is, predicting rain if there is actually no rain is just as bad as predicting no rain if it
actually rains.
Does your answer change if predicting rain for a day without rain is a negligible error? Justify why/why not.
Question 6: Report
Write a concise report showing your analysis for Question 1-5.
Demonstrate that you have followed appropriate training and evaluation procedures, and justify your
conclusions with relevant evidence from the evaluation output.
Where there are alternatives (e.g. measures, procedures, models, conclusions), demonstrate that you have
considered all relevant alternatives and justify why the selected alternative is appropriate.
Do not include the python code in your report.
Submission Instructions
Submit a single zip archive containing the following:
• weather.ipynb: the Jupyter Notebook file.
• weather.html: the HTML version of weather.ipynb showing the notebook including all output.
Create this by selecting File>Download as>HTML after having run all cells in the Jupyter notebook.
• report.pdf: the report as specified in Question 6.