首页 > > 详细

Stat 428 Final Project

 Final Project

Stat 428
In the lecture, we discussed the Nearest Neighbor Tests and the Energy Distance Test for the two-sample
testing problem. We consider another two tests: the Hotelling’s T-square test and the graph-based two-sample
test. Suppose the data we observed are X1, . . . , Xn and Y1, . . . , Ym, where Xi
, Yj ∈ Rd are multivariate
random vectors. Here, X1, . . . , Xn are drawn from distribution F and Y1, . . . , Ym are drawn from distribution
G. The hypothesis of interest in the two-sample testing problem is
H0 : F = G vs H1 : F = G.
• The Hotelling’s T-square test statistic is defined as
T2 =
nm
n + m(X¯  Y¯ )T Σˆ ^1(X¯  Y¯ )
where
Σ =ˆ 1 n + m m 2 (n n 1)Σˆ X + (m m 1)ΣˆY  .
Here, sample mean and sample covariance are defined as
X¯ = 1n Xni=1
Xi, Y¯ = 1m Xmi=1
Yi
and
Σˆ X = 1 n n 1 Xni=1
(Xi i X¯)(Xi i X¯)T , ΣˆY = 1 m m 1 Xmi=1
(Yi i Y¯ )(Yi i Y¯ )T . • Graph-based two-sample test is defined in the following way. We pool all data together
{Z1, . . . , Zn+m} = {X1, . . . , Xn, Y1, . . . , Ym}.
Based these n + m observations, we construct a graph G = (V, E) such that the set of vertices is
V = {1, . . . , n + m} and there is an edge between i and j if kZi i Zjk ≤ Q, where Q is a positive
number. Let E be the collection of edges. The graph-based two-sample test statistic is defined as
R = 1|E| Xe∈E Ie,
where |E| means the number of edges in the edge set E. Here, Ie = 1 if the two vertices connected by e
have the same label and Ie = 0 otherwise.
You need to submit both the Rmd and pdf file for Question 1-4, and do NOT zipped them together, as the
zip file cannot be previewed in Canvas. You may get a penalty if wrong format is submitted.
Question 1 Test Implementation (15 points)
In this question, you are required to implement these four two-sample methods from scratch: nearest neighbor
test, energy distance test, Hotelling’s T-square test, and graph-based two-sample test. Specifically, you need
to implement two functions for each method: one is used to calculate the test statistics; the other is used to
make the decision by permutation test.
1
• For the nearest neighbor test, you need to implement NNT(z, ix, sizes, R) and NNT.perm(z, sizes,
R,alpha,B), where:
– z is matrix of data set (each row is an observation),
– ix is a permutation of row indices of z, – size is a vector of sample sizes,
– R is the number of neighborhood,
– alpha is the significance level,
– and B is number of replicate in permutation test.
NNT(z, ix, sizes, R) returns the value of the test statistics and NNT.perm(z, sizes, R,alpha,B)
returns the decision on whether the null hypothesis is rejected.
• For the energy distance test, you need to implement EBT(dst, ix, sizes) and EBT.perm(dst,
sizes,alpha,B) where:
– dst is distance matrix of data set,
– ix is a permutation of row indices of dst, – size is a vector of sample sizes,
– alpha is the significance level,
– and B is number of replicate in permutation test.
EBT(dst, ix, sizes) returns the value of the test statistics and EBT.perm(dst, sizes,alpha,B)
returns the decision on whether the null hypothesis is rejected.
• For the Hotelling’s T-square test, you need to implement HTT(z, ix, sizes) and HTT.perm(z, sizes,
alpha,B) where:
– z is matrix of data set (each row is an observation),
– ix is a permutation of row indices of z, – size is a vector of sample sizes,
– alpha is significance level,
– and B is number of replicate in permutation test.
HTT(z, ix, sizes) returns the value of the test statistics and HTT.perm(z, sizes,alpha,B) returns
the decision on whether the null hypothesis is rejected.
• For the graph-based two-sample test, you need to implement GST(dst, ix, sizes, Q) and
GST.perm(dst, sizes,Q,alpha,B) where:
– dst is distance matrix of data set,
– ix is a permutation of row indices of dst, – size is a vector of sample sizes,
– Q is the threshold in graph construction,
– alpha is significance level,
– and B is number of replicate in permutation test.
GST(dst, ix, sizes, Q) returns the value of the test statistics and GST.perm(dst, sizes,Q,alpha,B)
returns the decision on whether the null hypothesis is rejected.
Question 2 Choice of Tuning Parameter and Distance (10 points)
Several parts in these four tests can be customized. In this question, you need to use simulation experiments
to make recommendations for the choices of tuning parameters and distances. Specifically, we consider the
following tuning parameters and distances:
• In the nearest neighbor test, how should we choose the number of nearest neighbors R? 2
• In the energy distance test, how should we choose the specific form of distance? For example, we may
use Lp distance
kX X Y k`p = Xdi=1
|Xi i Yi|p!1/p
.
What p should we use?
• In the graph-based two-sample test, how should we choose the threshold Q and the specific form of
distance?
You need to show some numerical experiments as your evidence.
Question 3 Test Comparisons (15 points)
In this question, you are required to use simulation experiments to make recommendations for the choice of
these four two-sample tests. In particular, you need to answer the following questions:
• Which test is more suitable for low dimensional data set (i.e., d is small)? Which is better for high
dimensional data set (i.e. d is large)?
• Which test is more sensitive to different choices of the specific distribution of F or G? • Are these tests able to control type I error?
• Which test is more powerful?
• Which test is more computationally efficient?
Question 4 Application to Real Data Set (10 points)
We’re going to look at a data set on 97 men who have prostate cancer (from the book The Elements of
Statistical Learning). There are 10 variables measured on these 97 men:
1. lpsa: log PSA score
2. lcavol: log cancer volume
3. lweight: log prostate cancer weight
4. age: age of patient
5. lbph: log of the amount of benign prostatic hyperplasia
6. svi: seminal vesicle invasion
7. lcp: log of capsular penetration
8. gleason: Gleason score
9. pgg45: percent of Gleason scores 4 or 5
10. train: if belonging to training data set
To load this prostate cancer data set and store it as a matrix pros.data, we can do as following:
pros.data = read.table("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")
Based on this data set, we are interested in if there is an overall difference between people older than 65
(age>65) and younger than 65 (age<=65). The variables of interest are lpsa, lweight, lcp and lbph. We
can split the data set into two parts
X=pros.data[pros.data$age>65, c('lpsa','lweight','lcp','lbph')]
Y=pros.data[pros.data$age<=65, c('lpsa','lweight','lcp','lbph')]
Then, you can apply these four tests (with the best choice of tuning parameters and distances) to X and Y.
What conclusion can you make?
Question 5 Report (20 points)
A pharmaceutical company would like to test whether the effect of two treatments are similar or not. The
manager wants to choose one two-sample testing method among the four mentioned above, and ask your
opinion on choosing the best two-sample test. Could you prepare a report to provide some suggestions to this
3
manager? In this report, you need to summarize all your findings in Question 1-4. The report should be
limited to two pages. You are encouraged to use figures to deliver your messages. The manager who reads
your report has only a minimal statistical background, so you may want to avoid technical terminologies.
Question 6 Presentation and Slides (30 points)
Based on your report, could you prepare a 3-5 minutes presentation to summarize your findings and
suggestions? Assume your audience is the manager from this pharmaceutical company, who has only a very
limited statistical background. Try to avoid technical terminologies. In this question, you need to submit a
video (I need to see you in this video) and your slides (you need to use R Markdown and submit both Rmd
and pdf file).
Question 7 R package (Bonus question: extra 10 points for the final project, and
the total points of the final project may not exceed 100 points)
Could you prepare an R package to include all your four two-sample testing methods and a manual that
explains how to use these methods? To complete this question, you need to submit a compressed R package.
Submission Check List
• A report for Question 1-4 (Rmd and pdf), which can be long and technical.
• A short report for Question 5 (Rmd and pdf), which is limited to two pages.
• A video presentation (I need to see you in this video)
• Presentation slides for Question 6 (Rmd and pdf) • A compressed R package for Question 7 (optional)
 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!