NAME:
SID: Homework 2
EE6435
February 10, 2020
Homework 2 is due at 11:59PM on Feb. 20. Please submit your homework
via CANVAS. No late work will be graded. You can type or submit scanned
version of your handwritten solutions. Some problems don’t allow partial credit.
If allowed, all partially correct answers get half of the full mark in order to make
the grading more standard. For example, if the answer to Problem 5 is partially
correct, the mark is 2.5 pts.
Problem 1 (15 pts)
Obtain one of the data sets available at the UCI Machine Learning Repository and apply
three visualization techniques from Lecture 2 (histogram, scatter plot, box plot). For scatter
plot, you can just choose two attributes. You can use excel, matlab, R, or Weka for
producing the required visualization results.
weka: https://www.cs.waikato.ac.nz/ml/weka/
UCI: https://archive.ics.uci.edu/ml/index.php
I suggest the following two data sets at UCI: Student Academics Performance Data Set and
Bank Marketing Data Set.
In your submission, clearly describe what tool/tools you used for generating the results.
Describe any codes/commands you used as well. Then attach the figures.
Problem 2 (15 pts, no partial credit)
Considering the training examples shown in Table 1 for a binary classification problem.
(a) What is the entropy of this collection of training examples with respect to the positive
and negative classes?
(b) For a3, which is a continuous attribute, compute the information gain for every possible
binary split.
(c) What is the best split (between a1 and a2) according to the Gini index.
Problem 3 (20 pts, no partial credit)
Considering the training examples in Table 2, where X, Y, and Z are attributes.
(a) Compute a two-level decision tree using the greedy approach described in Lecture 3. Use
the entropy as the criterion for splitting.
(b) For the deduced tree, what are the error rates of all the leaf nodes?
1
NAME:
SID: Homework 2
EE6435
February 10, 2020
Table 1: Data set for Problem 2
Instance a1 a2 a3 Target Class
1 T T 1.0 +
2 T T 4.0 +
3 T F 5.0 0
4 F F 4.0 +
5 F T 7.0 0
6 F T 6.0 0
7 F F 8.0 0
8 T F 7.0 +
9 F T 3.0 0
Table 2: Data set for Problem 3
X Y Z Num of Class C1 examples Num of class C2 examples
0 0 0 10 15
0 0 1 0 10
0 1 0 0 20
0 1 1 45 10
1 0 0 8 42
1 0 1 12 8
1 1 0 5 0
1 1 1 5 10
Problem 4 (15 pts)
Apply decision tree to the iris data set and also the data set you choose in Problem 1 using
Weka. Submit and explain the decision trees produced by weka.
You need to install Weka on your computer: https://www.cs.waikato.ac.nz/ml/weka/
In addition, the above website provides YouTube clips about the classification models.
Problem 5 (5 pts)
CityU has a set of rules to decide the academic status (i.e. class) of an undergraduate student
using the GPA-related attributes. These rules can be found in the file at Canvas/files/data/.
Instead of using these rules, use a decision tree to represent them (no training is needed).