辅导COMP 250、讲解Java编程设计、辅导KDTree、讲解Java语言讲解留学生Prolog|辅导R语言程序

Assignment 3
COMP 250, Fall 2019
[document last modified: November 4, 2019]
General instructions
• Add code where instructed, namely in the ADD CODE HERE block. You may also add
helper methods and extra fields if you wish.
• All import statements in your KDTree.java file will be removed, except for the ones that
are in the starter code.
• The starter code includes a tester class that you can run to test if your methods are correct.
• Your code will be tested on valid inputs only.
• You may submit multiple times, e.g. if you realize you made an error after you submit. For
any run of the Grader, you will receive the best score on your submissions.
• Late assignments will be accepted up to two days late and will be penalized by 10 points per
day. If you submit one minute late, this is equivalent to submitting 23 hours and 59 minutes
late, etc. So, make sure you are nowhere near that threshold when you submit.
• See the Submission Instructions at the end of this document for more details.
• Search for the keyword updated to find any the places where the PDF has been altered since
the original posting.
1 Introduction
In this assignment, we extend the powerful idea of binary search trees from one dimension to
multiple dimensions. In particular, we will work with points in <
k where k can be any positive
integer. Such points might represent image data, audio data, or text. For example, machine learning
methods often map such common data to feature vectors in a k − D feature space, and then use
these features in various ways.1
In binary search, one is given a key and tries to find that key amongst many that are stored in a
data structure. We will examine a related but different problem for k − D data. Our problem will
be to take a given point, which we will call a query point rather than a “key”, and search for the
nearest point in the data set. In the field of Machine Learning, the nearest point is typically referred
to as the nearest neighbor.
One way to find the nearest point would be to iterate by brute force over all data points, compute
the distance from the query point to each data point, and then return the data point that has
the minimum distance.2 However, this brute force search is inefficient, for the same reason that
linear search through a list is inefficient relative to binary search, since in the worst case it requires
examining each data point. One would instead like to have a “nearest point” method that is must
faster, by restricting the search to a small subset of the data points.
1.1 KD-Trees
One way to restrict the search for a query point is to use a tree data structure called a kd-tree. This
is a binary tree which is similar to a binary search tree, but there are important differences. One
difference between a kd-tree and a binary search tree concerns the data itself. In a binary search
tree, the data points (keys) are ordered as if on a line, or in Java terms they are Comparable. In a
kd-tree, there is no such overall ordering. Instead, we only have an ordering within each of the k
dimensions, and this ordering can differ between dimensions. For example, consider k = 2, and let
points be denoted (x1, x2). We can order two points by their x1 value and we can order two points
by their x2 value, but we don’t want to order (x1, x2) and (x01, x02) in general. For example, in the
case x1 < x01 but x2 > x02, we do not want to impose an ordering of these two points.
Another important difference is how binary search trees and kd-trees are typically used. With
a binary search tree, one typically searches for an exact match for the search key. With a kd-tree,
one typically searches for the nearest data point. We will use the Euclidean distance. Give a query
point xq ∈ Z
k we want to find a data point x[ ] that minimizes the squared distance,
sqrdist(xq[ ], x[ ]) ≡Xi(xq[i] − x[i])2.
We used squared distance rather than distance because taking the square root buys us no extra
information about which point is closest.
Note that there may be multiple data points that are the same (squared) distance from a query
point xq. One could define the problem slightly differently by returning all points in this non-unique
case. Or, one could ask for the nearest n points, and indeed this is common. We will just keep it
simple and ask for a single nearest point. If multiple points lie at that same nearest distance, then
any of these points may be returned.
Kd-trees which are a particular type of binary tree. Each internal node implicitly defines a
k−dimensional region in