解析CS5487Python语言程序、Python设计辅导留学生、解析Python语言程序

CS5487 Programming Assignment 2
Clustering
Antoni Chan
Department of Computer Science
City University of Hong Kong
In this programming assignment you will implement and test several clustering algorithms on both
synthetic and real data. Given a set of points X = fx1; ;xng, where xi 2 Rd, the goal is to
assign a cluster label to each point, yi 2f1; ;Kg, where K is the number of clusters. In this
programming assignment, you will consider three clustering algorithms:
1. K-means algorithm { The K-means algorithm calculates the cluster centers j using the
current assignment to each cluster (K is assumed to be known). In each iteration, K-means
performs the following two steps:
The cluster label for point xi is the label of the closest cluster center, yi = argmaxjzij.
2. EM algorithm for Gaussian mixture models (EM-GMM) { The EM algorithm cal-
culates a maximum likelihood estimate of a GMM with K components, with parameters
After convergence, the cluster label for xi is the component with the largest posterior proba-
bility, yi = argmaxj ^zij. The document \gmm-tips", available on the course website, contains
some useful tips on implementing EM for GMMs.
3. Mean-shift algorithm { The mean-shift algorithm uses gradient ascent with an adaptive
step-size to nd local peaks in a kernel density estimate of X. We will use a Gaussian kernel
with bandwidth h. Given an initial point ^x(0) (where the superscript. denotes the iteration
number), the mean-shift algorithm update rule is:
^x(t+1) =
Pn
i=1xiN(xij^x
(t);h2I)
Pn
i=1N(xij^x(t);h2I)
: (6)
To obtain the clustering, the mean-shift algorithm is run using each data point xi as an initial
point. After convergence, the points that converged to the same local peak are given the same
cluster label. In this case, the number of clusters K is not xed (and depends on the selected
bandwidth h).
Problem 1 Clustering synthetic data
In the rst problem, you will test these clustering methods on synthetic data, and examine how each
method performs on di erent con gurations of data. The zip le PA2-cluster-data.zip contains
the synthetic data les. Inside the MATLAB data le cluster data.mat (or cluster data *.txt
if you are not using MATLAB) are three data sets (points and ground-truth labels). The le
contains the following variables:
dataA X - dataset A points. Each column is a data point, xi2R2.
dataA Y - dataset A true labels, fyig.
dataB X - dataset B points.
dataB Y - dataset B true labels.
dataC X - dataset C points.
dataC Y - dataset C true labels.
The three datasets are shown in the following gure, where the color represents the ground-truth
Next, a clustering algorithm is used to group the feature vectors, and a cluster label is assigned to
each corresponding pixel forming the segmentation.
The zip le PA2-cluster-images.zip contains the images and MATLAB code for ex-
tracting the features, and creating the segmentation image from the cluster labels. The following
MATLAB functions are provided:
getfeatures.m { MATLAB function for extracting features from an image.
labels2segm.m { MATLAB function for generating a segmentation image from the cluster
labels.

colorsegm.m { MATLAB function to create a nicely colored segmentation image.
As an example, the following code was used to generate the above segmentation images:
% read the image and view it
img = imread(’images/12003.jpg’);
subplot(1,3,1); imagesc(img); axis image;
% extract features (stepsize = 7)
[X, L] = getfeatures(img, 7);
XX = [X(1:2,:) ; X(3:4,:)/10]; % downscale the coordinate features (see part (b))
% run kmeans -- this is the MATLAB version. You have to write your own!
[Y, C] = kmeans(XX’, 2);
% make a segmentation image from the labels
segm = labels2segm(Y, L);
subplot(1,3,2); imagesc(segm); axis image;
% color the segmentation image
csegm = colorsegm(segm, img);
subplot(1,3,3); imagesc(csegm); axis image
Documentation for each function can be viewed in MATLAB using \help getfeatures", etc.
For Python users, the le PA2-cluster-python.zip contains Python versions of the above
demo code and helper functions for extracting features. This Python code requires the numpy,
scipy, matplotlib, and Image modules. Alternatively, the le PA2-cluster-imfeatures-txt.zip
contains a text les of the extracted image features for a step size of 7.
(a) Use the three clustering algorithms to segment a few of the provided images. Qualitatively,
which algorithm gives better results? How do the results change with di erent K and h?
Which is less sensitive to changes in the parameters? Comment on any interesting properties
or limitations observed about the clustering algorithms.
The feature vector x contains two types of features that are on di erent scales; the chrominance
values (u;v) range from 0-255, while the pixel locations (cx;cy) range from 0-512. The EM-GMM
clustering algorithm can adapt to the di erent scales of each feature by changing the covariance
matrix. On the other hand, K-means and mean-shift assume a xed isotropic covariance matrix.
We can modify these two algorithms to allow di erent scaling of the features:
For K-means, the distance to a cluster center x0 is modi ed to apply a weighting between the
feature types,
where 0 controls the weighting.
For mean-shift, the kernel is modi ed to use separate bandwidths on each feature type,
where hc is the bandwidth for the color features, and hp is the bandwidth for the pixel
locations.
(b) Modify your K-means and mean-shift implementations to allow di erent feature scaling. Hint:
changing the distance in (7) or kernel in (8) is equivalent to scaling each dimension in the feature
vector x by an appropriate amount. Rerun the segmentation experiment. Do the segmentation
results improve?
.........
Problem 3 Quantitative Evaluation of Clustering (Optional)
In this optional problem, you will quantitatively evaluate your results from Problems 1 and 2.
Consider a set S = fs1; ;sng with n elements, and two partitions of S into subsets, Y =
fY1; ;YRg and Z =fZ1; ;ZCg, where Yi are the subsets of Y, and similarly for Zi and Z.
The Rand index can be used to quantitatively measure the amount of agreement between
the two clusterings of the set S. Intuitively, the Rand index corresponds to the probability of
pair-wise agreement between the Y and Z, i.e. the probability that the assignment of any two
items will be correct with respect to each other (in the same cluster, or in di erent clusters).
Formally, the Rand index is calculated as follows. De ne the following counts:
a { the number of pairs in S that are in the same set in Y and in the same set in Z,
b { the number of pairs in S that are in di erent sets in Y and in di erent sets in Z,
c { the number of pairs in S that are in the same set in Y and in di erent sets in Z,
d { the number of pairs in S that are in di erent sets in Y and in the same set in Z,
then the Rand index is the fraction of pairwise agreements
The Rand index can be e ciently calculated from a contingency table:
Partition Z
Class Z1 Z2 . . . ZC Sums
Partition Y
Y1 n11 n12 . . . n1C r1
Y2 n21 n22 . . . n2C r2
... ... ... ... ...
YR nR1 nR2 . . . nRC rR
Sums c1 c2 . . . cC n
Each entry nij is the number of items in S that are common to Yi and Zj, and cj and ri are the
sums over the jth column and ith row, respectively. Using the contingency table, the number of
agreements (the numerator in (9)) can be calculated as
(a) Problem 1: Use the Rand index to evaluate the your clusterings with the ground-truth clusters.
Which algorithm performs best overall, and for each individual dataset? How do the results
change with the hyperparameter value (K or h)? (e.g., make a plot)
(b) Problem 2: The ground-truth segmentations for the images are available in the \gtruth" direc-
tory of the zip le. The le names are \-.png", where is the
image lename and is the ID of the human who marked the segmentation. In the
segmentation image, each grayvalue corresponds to one segment. Note that the segment labels
(gray-values) are evenly spread out between 0-255 so that the segmentations can be viewed
(you may want to remap it back to 1-K). Use the Rand index to evaluate your segmentations.
Overall, which algorithm performs best at segmentation? How do the results change with the
hyperparameters (K or h)? (e.g., make a plot).
References for the Rand index are available here:
W. M. Rand. \Objective criteria for the evaluation of clustering methods". Journal of the American
Statistical Association, 66 (336): 846-850, 1971.
Lawrence Hubert and Phipps Arabie. \Comparing partitions". Journal of Classi cation, 2 (1): 193-
218, 1985.
.........
What to hand in { You need to turn in the following things:
1. Answers, plots, analysis, discussion, etc. for the above problems.
2. Source code les.
You must submit your programming assignment using the BlackBoard website. Go to \As-
signments" ) \Assignment submissions" ) and create a journal entry.
Plagiarism { You must implement each clustering algorithm using your own code. Do
not use someone else’s implementation of the clustering algorithms (including MATLAB’s
implementation). It is okay to use common library components and functions, e.g. standard
matrix operations (e.g., inv, eig). If you use any special toolboxes or libraries, make sure to
note them in your report.
Grading { The marks for this assignment will be distributed as follows:
{ 20% { Implementation of the K-means, EM, and mean-shift algorithm (1a).
{ 20% { Results on toy datasets (1b).
{ 10% { Sensitivity to parameters (1c).
{ 20% { Results on image segmentation and sensitivity to parameters (2a).
{ 10% { Results using feature scaling (2b).
{ 20% { Quality of the written report. More points for insightful observations and analysis.
Note: if you really cannot implement the algorithms correctly, it is okay to use 3rd party
software. You will not get marks for implementation, but you can still get marks for presenting
the results. If you use a 3rd party implementation, you must acknowledge the source in your
report.
Optional Problems { The optional problem 3 will not be graded.