http://localhost:8888/notebooks/Project1.ipynb 1/5
Project tasks
Please rename this file so that you know which copy you have been working in. Keep a copy
safe (especially if you are working in the online Jupyter service). You can download a copy by
choosing -File- then -Download as- Notebook from the menu above.
Complete all of the tasks
Make sure to use good code style, and comment your code appropriately.
Task 1 - Code review
This task is to write a code review, not to write python code to solve the problem brief.
A colleague has been asked to write some code to answer the following brief:
Brief:
The determinant of an matrix can be calculated recursively by "row/column expansion", for
any given column :
where is the element of in the -th row and -th column, and is the matrix
obtained from by deleting the -th row and -th column.
The above formula works for any (you can just use : expansion of the first column).
You should:
Write a function to (recursively) work out the determinant of a two-dimensional Numpy array
without using the inbuilt functions.
Write a further function which determines the volume of the parallelepiped spanned by the
vectors , , , which is given by the absolute value of:
using your determinant function.
Your task:
g4Fg6Dg4F g22g4B
g45g46g55g9g22gAg1E g9g131g12 g45g46g55g9 gAg22
g4Ag1Eg12
g4F gAg9g4AgCg4BgA g42
g4Ag4B g22g67g4Ag4B
g42g4Ag4B g22 g4A g4B g22g67g4Ag4B g9g4Fg131g12gAg6Dg9g4Fg131g12gA
g22 g4A g4B
g4B g4B g1Eg12
g9 gD gD gAg59g12 g5Ag12 g5Bg12 g9 gD gD gAg59g13 g5Ag13 g5Bg13 g9 gD gD gAg59g14 g5Ag14 g5Bg14
g45g46g55
g21
g1F
g20g20g59g12g59g13
g59g14
g5Ag12
g5Ag13
g5Ag14
g5Bg12
g5Bg13
g5Bg14
g24
g22
g23g23
11/1/2018 Project1
http://localhost:8888/notebooks/Project1.ipynb 2/5
You have been asked to write a review of their code. Here is the code they produced:
In [1]:
You should write your review here. Things you could choose to discuss:
Code structure
Code style
Does it answer the brief?
Does it work? If not could it be fixed?
Can you explain what it does?
Keep your answer relatively brief (approx. 500 words).
Type your review in this cell.
Task 2 - K-Nearest-neighbours
Task 2a - First implementation
1.0
1.0
from numpy import *
def my_det(A,n):
# Calculate the determinant of a matrix
if n==1:
return A[0,0] #Return a single value in this case
else:
det = 0.0
for i in range(0,n-1):
# Remove the row and column:
B = delete(A,i,axis=0)
C = delete(B,0,axis=1)
#print(C)
det = det + (-1)**(i+1)*A[i,0]*my_det(C,n-1) #implement the formula
return det
def paraVol(a,b,c):
'''
Get the volume of a parallelepiped spanned by a,b,c.
'''
A = array([[a[0],a[1],a[2]],[b[0],b[1],b[2]],[c[0],c[1],c[2]]])
return my_det(A,3)
# Test on the identity matrix:
A = identity(3)
print(my_det(A,3))
print(linalg.det(A)) #compare with built-in function
11/1/2018 Project1
http://localhost:8888/notebooks/Project1.ipynb 3/5
For this task you will need to implement a K-nearest-neighbours (https://en.wikipedia.org/wiki/K-
nearest_neighbors_algorithm) algorithm for classification from scratch (important - do not use a
version of this algorithm from another module e.g SKLearn - you need to write the functions
as directed yourself). The K-nearest-neighbours algorithm is a classic machine learning algorithm
used for classification problems (https://en.wikipedia.org/wiki/Statistical_classification). Classifying a
data item from a given dataset means deciding which of a number of classes the item belongs to.
This is done using a training set, containing a number of data items with known class.
We will consider first a simple implementation of the 3-nearest-neighbour version of the algorithm.
The algorithm proceeds as follows.
For each item in the dataset, we find the 3 nearest items (and their respective classes) in the
training set. We then attribute a class to our data item by majority voting amongst the classes of the
3 nearest neighbours. In the case where three classes are tied in the vote, we resolve the tie by
choosing the class of whichever neighbour is the nearest in the training set. (In practice this might
not be a very good thing to do - but we are just constructing a simple implementation for now).
Finally, we need to output the class we have decided on for each item in our original dataset.
The data is contained in two files main_data.txt and train_data.txt.
Each item in main_data.txt must be classified into one of 7 classes, using the training set
train_data.txt.
Each line of train_data.txt defines 1 training item, as a vector of floating-point values,
followed by its class label (a single integer in the range 1-7).
Each line of main_data.txt defines 1 data item, as only a vector of floating-point values,
without a class label.
To do this write 3 functions:
data_setup() which reads in both the dataset contained in main_data.txt, and the training
dataset contained in train_data.txt into two lists - returning the two lists (main_data, and
train_data) from the function. The lists containing each item of the dataset should use
appropriate types for each element of the list.
dist_vect(item,train_data) which takes as arguments one element of the list main_data,
and the list train_data, both from the output of the data_setup function. The function should
calculate the Euclidean distance from item in the dataset to each item in the training dataset in
order. The function should return a list of (distance, class) tuples.
decide_class(dv) which takes the output list of tuples of the dist_vect function as an
argument, and uses it to return the class of the item in the dataset.
Lastly use your functions to obtain a list containing the calculated class for each of the element
in the main_data list you have constructed.
In [ ]:
Task 2b - Changing the distance measure
11/1/2018 Project1
http://localhost:8888/notebooks/Project1.ipynb 4/5
Next write two different replacements for the dist_vect(item,train_data) function. The function you
have written above calculates the distance between data items using Euclidean distance as:
For this task write two more versions of the distance function using different distance metrics.
Firstly dist_vectL1(item,train_data) where distance is determined by using:
Next, dist_vectLinf(item,train_data) where distance is determined by using:
These are the and norms respectively.
Lastly use the new functions to obtain a list containing the calculated class for each of the
element in the main_data list you have constructed.
g45g9g59gDg5AgAg1Eg2 g9 g131 g3g22
g4Ag1Eg12
g4F g59
g4A g5Ag4AgAg13
g12g10g13
g45g9g59gDg5AgAg1E g131g22
g4Ag1Eg12
g4F g5D
g5Dg59g4A g5Ag4Ag5Dg5D
g45g9g59gDg5AgAg1E g131g4Eg42g59g4Ag1Eg12gFgFg4Fg5Dg5Dg59g4A g5Ag4Ag5Dg5D
g2Dg12 g2Dg13B
In [ ]:
Task 2c - Changing the decision methodology
The simple majority-voting implemented in your decide_class() function is unweighted --- once
the 3 nearest neighbours are found, they each have 1 vote, regardless of which is closest (unless
there is a tie). An alternative is weighted voting, where each of the nearest neighbours is
attributed a coefficient, such that the closest neighbours carry more weight in the result.
Write a function decide_class_wk(dv,k) which also takes the number of nearest neighbours as
an input argument. Your function should attribute a weight to each of the nearest neighbours,
inversely proportional to their distance from the data item:
The score for each class is determined by the sum of the weights of each item in that class amongst
the nearest neighbours. The class with the highest score is chosen for the data item.
(Note that the simple majority-voting corresponds to setting all the weights to 1.)
Test your function on the dataset, for and .
g4C
g4C
g58g4B g4C
g1E gD g4Bg12Dg5Cg12gDgD6gDg4Cg5Eg58g4B g12g9g59gDg5AgAg45
g4B
g4C
g4Cg1Eg14gDg19gD g13g16
In [ ]: