2022 COMP309/AIML421 ML Tools and Techniques:
Assignment 1
This assignment has 100 marks (120 marks for AIML 421 students) and is due on 23:59pm,
Monday, 08 August 2022. Please submit your report as a single .pdf file including figures and
tables as required, PLUS your code as a separate file(a jupyter notebook as a .ipynb file or .py
file). Make sure you read the Assessment and Submission sections at the end of this file. This
assignment contributes 20% to your overall course grade.
Objectives
This assignment involves trying out a variety of classification and clustering algorithms. It
requires use of python, numpy, matplotlib, and scikit-learn, and serves as an introduction to all
those tools. You can run python in a jupyter notebook but there are others option to do the
development.
1 Classification [70 marks]
The part of assignment is to explore several classifiers in scikit-learn and investigate the hyperparameter for complexity control for each of these classifiers on three datasets by setting
the hyperparameter to a range of plausible values and seeing how well it does on ”held out”
data. To do this you will need train test split from scikit-learn. To get better estimates, simply repeat 50 times with different random splits (set the seed to get reproducible results). For
simplicity, use a 50:50 train:test split in all cases. For each setting of the hyperparameter, you
then have a distribution over 50 different classification accuracies on the test set. A nice way
to visualise these scores is to produce a box plot where the x-axis gives options for a parameter
of the model, while the y-axis indicates the spread for classification accuracies of the classifier.
Titles and Axis Labels are needed for clarity.
1.1 Machine Learning Models:
You will be trying out the following classifiers in scikit-learn:
1
(a) KNeighborsClassifier (K nearest neighbours)
(b) GaussianNB (the Gaussian form of Naive Bayes)
(c) DecisionTreeClassifier (A decision tree (DT))
(d) LogisticRegression (essentially, a perceptron)
(e) GradientBoostingClassifier (Gradient Boosted DTs)
(f) RandomForestClassifier (Random Forest)
(g) MLPClassifier (Neural Network)
1.2 Datasets:
You will need to test the models on the following three datasets (you can easily download
them from OpenML).
1. steel-plates-fault
2. ionosphere
3. banknotes
1.3 Tasks:
You should:
(i) Build a 7-by-3 (7 classifiers and 3 datasets) table to present the boxplots on the classifier
accuracy versus parameter values ( as the example table below). It probably won’t fit into
one summary figure. You should structure it as separate main plots, one per classifier,
each consisting of several subplots for different datasets.
steel-plates-fault ionosphere banknotes
KNN
GNB ... ... ...
... ... ... ...
You need to consider the parameter and the corresponding values for each classifier as
follows. Other hyperparameters should be left as default.
(ii) Present two summary tables, with rows being classifiers, and columns being datasets.
Table (1) is to contain the best mean value of the test errors.
Table (2) is to contain the best value for the hyperparameter.
(iii) Write a paragraph to compare and analyse the overall results as captured in these two
tables including which model has the best performance and why, how sensitive these
models are to the complexity control hyperparameter.
2
Classifier HyperParameter Values
1 KNeighborsClassifier ”n neighbors” [1, 2, 3, 4, 5]
2 GaussianNB ”var smoothing” [1e − 9, 1e − 5, 1e − 1])
3 LogisticRegression ”C” [0.1, 0.5, 1.0, 2.0, 5.0])
4 DecisionTreeClassifier ”max depth” [1, 3, 5, 8, 10]
5 GradientBoostingClassifier ”max depth” [1, 3, 5, 8, 10]
6 RandomForestClassifier ”max depth” [1, 3, 5, 8, 10]
7 MLPClassifier ”alpha” [1e − 5, 1e − 3, 0.1, 10.0])
2 Clustering [30 marks]
The scikit-learn library provides a suite of different clustering algorithms to choose from, each
of which offers a different approach to discover the natural groups in data.
In this part of assignment, you will explore the characteristics of different clustering algorithms by testing them on three toy datasets. You will use the make blobs() function,
make classification (with ”n clusters per class=1”) and make circles (with ”noise=0.3”) to create three toy clustering datasets. Each of them will have 1, 000 examples/instances, with two
input features. Set the seed in the dataset generators to get reproducible results.
2.1 Machine Learning Models:
You will be trying out the following clustering algorithms in scikit-learn:
(a) K-Means
(b) Affinity Propagation
(c) DBSCAN
(d) Gaussian Mixture Model
(e) BIRCH
(f) Agglomerative Clustering
(g) Mean Shift
2.2 Tasks:
You should:
(i) Fit the clustering models on the datasets and predicts a cluster for each example in the
datasets. Build a 7-by-3 (7 clustering algorithms and 3 datasets) table to present the
scatter plots showing the clusters generated by each algorithm.
(ii) Write a paragraph to compare and analyse the results of the clustering algorithms on the
three datasets, highlight the characteristics of these algorithms.
3
3 Classification and Clustering (for AIML421 Students only)
[20 marks]
The real world often throws up situations in which we have a mountain of unlabelled data but
only relatively few examples with their labels. Ultimately this is because acquiring accurate
labels is expensive, since it often requires human expertise, whereas acquiring unlabelled data
is often automatic and cheap.
This part asks you to investigate whether it is possible to use features derived from unsupervised learning to enhance the predictions made by a supervised learner (in this case, a
Classifier). To do this, act as most of the data is missing its label: you will use the iris data and
split it with 95 percent as a test set, and only 5 percent is available for training. Set the seed for
splitting to get reproducible results. You need to:
• Use k-means with k=3 as the cluster method, run on the whole dataset, and add the
cluster label as a new column in X.
• For a hundred repeats:
– Randomly split the iris dataset (5:95 splitting as above);
– Train the GaussianNB() Classifier on the training data [a] without and [b] with the
new feature, i.e. the new column derived from clustering;
– Obtain the classification accuracy on the test set of [a] and [b].
3.1 Tasks:
You should then,
(i) Provide a scatterplot of the accuracy of [a] on the x-axis against the the accuracy of [b] on
the y-axis. The Titles and Axis labels for the plots are needed for clarity.
(ii) Write a paragraph explaining whether the new feature derived from unsupervised learning can enhance the classification predictions and why.
Assessment
Format: You can use any font to write the report, with a minimum of single spacing and 11
point size (hand writing is not permitted unless with approval from the lecturers). Reports are
expected to be 2-6 pages for COMP309 students and 2-8 pages for AIML421 students. Reports
exceeding the maximum page limit will be penalised.
Communication: A key skill required of a scientist is the ability to communicate effectively.
No matter the scientific merit of a report, if it is illegible, grammatically incorrect, mispunctuated, ambiguous, or contains misspellings, it is less effective and marks will be deducted.
Late Penalties: The assignment must be submitted on time unless you have made a prior
arrangement with the course coordinator or have a valid medical excuse (for minor illnesses it
is sufficient to discuss this with the course co-ordinator). The penalty for assignments that are
handed in late without prior arrangement is one grade reduction per day. Assignments that
are more than one week late will not be marked.
4
Plagiarism: Plagiarism in programming (copying someone else’s code) is just as serious as
written plagiarism, and is treated accordingly. Make sure you explicitly write down where
you got code from (and how much of it) if you use any other resources asides from the course
material. Using excessive amounts of others’ code may result in the loss of marks, but plagiarism could result in zero marks!
Submission
You are required to submit a single .pdf report PLUS the python code file (.ipynb or .py)
through the web submission system from the COMP309/AIML421 course website by the due
time. Provide a README.txt file if you use any non-standard python libraries.