Introduction
,pdf,,⼀
Requirement
In this coursework, you are going to develop a k-Nearest Neighbour algorithm in Java. The
algorithm is going to be integrated with the WEKA machine learning toolbox
(). You will also use the desharnais effort estimation dataset in
order to test your implementation.
(1) Operating system
WEKA is available both for Linux, Mac and Windows. For the purpose of this coursework, please
use Linux or Mac.
(2) Downloading WEKA and accessing WEKA’s API
You will need to download WEKA’s developer version 3.9.1, which can be found at:
Unzip the weka-3-9-1.zip file that you have downloaded. Later on, you will need to use the file
weka.jar, which will be inside your weka-3-9-1 unzipped folder.
During your coursework, you may find it useful to check WEKA’s API, which can be found in the
following link:
(3) Downloading supporting class files
In order to facilitate your implementation, I have implemented an abstract class called KnnParent
containing some of the methods necessary to implement k-NN and to integrate it with WEKA. I
have created a javadoc explaining the methods of this class. Some of these methods are related to
creating the parameter (a.k.a. option) k of the k-NN algorithm, and enabling it to be displayed in
WEKA’s GUI. Some other methods implement parts of k-NN’s algorithm itself. I recommend you
to read the javadoc descriptions of all methods, and to pay special attention to the abstract methods,
which you will need to override in your implementation, and the fields m_k, m_TrainingData, min
and max, which you will need to use in your methods.
KnnParent can be downloaded from:
KnnParent’s Javadoc can be found here:
You will need to create another class called MyKnn extending KnnParent. I have created a template
for your MyKnn class, containing some methods required for integrating your class with WEKA.
You can download it from:
(4) Downloading data set
In order to help you with testing your algorithm, please download the following data set:
This data set can be used to create models for estimating software effort. It has been acquired
from a former version of the SEACRAFT Repository
(). This
data set originally contained some examples whose input attributes had some missing values,
which have been removed for the purpose of this exercise. It also contained more than one
dependent variable. I have processed it so that it contains only the required effort as a dependent
variable.
Here are the first lines of this data set:
%% EDITED TO CONTAIN ONLY ONE DEPENDENT VARIABLE.
%% DEPENDENT VARIABLE (EFFORT) WAS MOVED AS THE LAST ATTRIBUTE.
%% Projects with missing attributes eliminated.
@relation ‘desharnais.csv-weka.filters.unsupervised.attribute.Remove-R1,5’
@attribute TeamExp numeric
@attribute ManagerExp numeric
@attribute YearEnd numeric
@attribute Transactions numeric
@attribute Entities numeric
@attribute PointsNonAdjust numeric
@attribute Adjustment numeric
@attribute PointsAjust numeric
@attribute Language {lang1,lang2,lang3}
@attribute Effort numeric
@data
1,4,85,253,52,305,34,302,lang1,5152
0,0,86,197,124,321,33,315,lang1,5635
4,4,85,40,60,100,18,83,lang1,805
0,0,86,200,119,319,30,303,lang1,3829
0,0,86,140,94,234,24,208,lang1,2149
0,0,86,97,89,186,38,192,lang1,2821
2,1,85,119,42,161,25,145,lang2,2569
1,2,83,186,52,238,25,214,lang1,3913
The lines starting with “%%” are comments.
@relation displays some information about how this data set was created / processed. You do not
need to worry about this information.
Each line containing @attribute represents one input or output attribute. It contains the name and
type of this attribute. For instance, numeric indicates a numerical attribute, whereas
{lang1,lang2,lang3} indicates a categorical attribute whose value can be either “lang1”, “lang2” or
“lang3”.
Note that even though the output attribute is the last one in the desharnais data set, the output
attribute could have been in any other position. For example, there could be a data set where
the output attribute is listed as the third out of ten attributes. Similarly, the categorical input