Stat 428 Final Project

Final Project

Stat 428

In the lecture, we discussed the Nearest Neighbor Tests and the Energy Distance Test for the two-sample

testing problem. We consider another two tests: the Hotelling’s T-square test and the graph-based two-sample

test. Suppose the data we observed are X1, . . . , Xn and Y1, . . . , Ym, where Xi

, Yj ∈ Rd are multivariate

random vectors. Here, X1, . . . , Xn are drawn from distribution F and Y1, . . . , Ym are drawn from distribution

G. The hypothesis of interest in the two-sample testing problem is

H0 : F = G vs H1 : F = G.

• The Hotelling’s T-square test statistic is defined as

T2 =

n + m(X¯ Y¯ )T Σˆ ^1(X¯ Y¯ )

where

Σ =ˆ 1 n + m m 2 (n n 1)Σˆ X + (m m 1)ΣˆY .

Here, sample mean and sample covariance are defined as

X¯ = 1n Xni=1

Xi, Y¯ = 1m Xmi=1

and

Σˆ X = 1 n n 1 Xni=1

(Xi i X¯)(Xi i X¯)T , ΣˆY = 1 m m 1 Xmi=1

(Yi i Y¯ )(Yi i Y¯ )T . • Graph-based two-sample test is defined in the following way. We pool all data together

{Z1, . . . , Zn+m} = {X1, . . . , Xn, Y1, . . . , Ym}.

Based these n + m observations, we construct a graph G = (V, E) such that the set of vertices is

V = {1, . . . , n + m} and there is an edge between i and j if kZi i Zjk ≤ Q, where Q is a positive

number. Let E be the collection of edges. The graph-based two-sample test statistic is defined as

R = 1|E| Xe∈E Ie,

where |E| means the number of edges in the edge set E. Here, Ie = 1 if the two vertices connected by e

have the same label and Ie = 0 otherwise.

You need to submit both the Rmd and pdf file for Question 1-4, and do NOT zipped them together, as the

zip file cannot be previewed in Canvas. You may get a penalty if wrong format is submitted.

Question 1 Test Implementation (15 points)

In this question, you are required to implement these four two-sample methods from scratch: nearest neighbor

test, energy distance test, Hotelling’s T-square test, and graph-based two-sample test. Specifically, you need

to implement two functions for each method: one is used to calculate the test statistics; the other is used to

make the decision by permutation test.

• For the nearest neighbor test, you need to implement NNT(z, ix, sizes, R) and NNT.perm(z, sizes,

R,alpha,B), where:

– z is matrix of data set (each row is an observation),

– ix is a permutation of row indices of z, – size is a vector of sample sizes,

– R is the number of neighborhood,

– alpha is the significance level,

– and B is number of replicate in permutation test.

NNT(z, ix, sizes, R) returns the value of the test statistics and NNT.perm(z, sizes, R,alpha,B)

returns the decision on whether the null hypothesis is rejected.

• For the energy distance test, you need to implement EBT(dst, ix, sizes) and EBT.perm(dst,

sizes,alpha,B) where:

– dst is distance matrix of data set,

– ix is a permutation of row indices of dst, – size is a vector of sample sizes,

– alpha is the significance level,

– and B is number of replicate in permutation test.

EBT(dst, ix, sizes) returns the value of the test statistics and EBT.perm(dst, sizes,alpha,B)

returns the decision on whether the null hypothesis is rejected.

• For the Hotelling’s T-square test, you need to implement HTT(z, ix, sizes) and HTT.perm(z, sizes,

alpha,B) where:

– z is matrix of data set (each row is an observation),

– ix is a permutation of row indices of z, – size is a vector of sample sizes,

– alpha is significance level,

– and B is number of replicate in permutation test.

HTT(z, ix, sizes) returns the value of the test statistics and HTT.perm(z, sizes,alpha,B) returns

the decision on whether the null hypothesis is rejected.

• For the graph-based two-sample test, you need to implement GST(dst, ix, sizes, Q) and

GST.perm(dst, sizes,Q,alpha,B) where:

– dst is distance matrix of data set,

– ix is a permutation of row indices of dst, – size is a vector of sample sizes,

– Q is the threshold in graph construction,

– alpha is significance level,

– and B is number of replicate in permutation test.

GST(dst, ix, sizes, Q) returns the value of the test statistics and GST.perm(dst, sizes,Q,alpha,B)

returns the decision on whether the null hypothesis is rejected.

Question 2 Choice of Tuning Parameter and Distance (10 points)

Several parts in these four tests can be customized. In this question, you need to use simulation experiments

to make recommendations for the choices of tuning parameters and distances. Specifically, we consider the

following tuning parameters and distances:

• In the nearest neighbor test, how should we choose the number of nearest neighbors R? 2

• In the energy distance test, how should we choose the specific form of distance? For example, we may

use Lp distance

kX X Y k`p = Xdi=1

|Xi i Yi|p!1/p

What p should we use?

• In the graph-based two-sample test, how should we choose the threshold Q and the specific form of

distance?

You need to show some numerical experiments as your evidence.

Question 3 Test Comparisons (15 points)

In this question, you are required to use simulation experiments to make recommendations for the choice of

these four two-sample tests. In particular, you need to answer the following questions:

• Which test is more suitable for low dimensional data set (i.e., d is small)? Which is better for high

dimensional data set (i.e. d is large)?

• Which test is more sensitive to different choices of the specific distribution of F or G? • Are these tests able to control type I error?

• Which test is more powerful?

• Which test is more computationally efficient?

Question 4 Application to Real Data Set (10 points)

We’re going to look at a data set on 97 men who have prostate cancer (from the book The Elements of

Statistical Learning). There are 10 variables measured on these 97 men:

1. lpsa: log PSA score

2. lcavol: log cancer volume

3. lweight: log prostate cancer weight

4. age: age of patient

5. lbph: log of the amount of benign prostatic hyperplasia

6. svi: seminal vesicle invasion

7. lcp: log of capsular penetration

8. gleason: Gleason score

9. pgg45: percent of Gleason scores 4 or 5

10. train: if belonging to training data set

To load this prostate cancer data set and store it as a matrix pros.data, we can do as following:

pros.data = read.table("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")

Based on this data set, we are interested in if there is an overall difference between people older than 65

(age>65) and younger than 65 (age<=65). The variables of interest are lpsa, lweight, lcp and lbph. We

can split the data set into two parts

X=pros.data[pros.data$age>65, c('lpsa','lweight','lcp','lbph')]

Y=pros.data[pros.data$age<=65, c('lpsa','lweight','lcp','lbph')]

Then, you can apply these four tests (with the best choice of tuning parameters and distances) to X and Y.

What conclusion can you make?

Question 5 Report (20 points)

A pharmaceutical company would like to test whether the effect of two treatments are similar or not. The

manager wants to choose one two-sample testing method among the four mentioned above, and ask your

opinion on choosing the best two-sample test. Could you prepare a report to provide some suggestions to this

manager? In this report, you need to summarize all your findings in Question 1-4. The report should be

limited to two pages. You are encouraged to use figures to deliver your messages. The manager who reads

your report has only a minimal statistical background, so you may want to avoid technical terminologies.

Question 6 Presentation and Slides (30 points)

Based on your report, could you prepare a 3-5 minutes presentation to summarize your findings and

suggestions? Assume your audience is the manager from this pharmaceutical company, who has only a very

limited statistical background. Try to avoid technical terminologies. In this question, you need to submit a

video (I need to see you in this video) and your slides (you need to use R Markdown and submit both Rmd

and pdf file).

Question 7 R package (Bonus question: extra 10 points for the final project, and

the total points of the final project may not exceed 100 points)

Could you prepare an R package to include all your four two-sample testing methods and a manual that

explains how to use these methods? To complete this question, you need to submit a compressed R package.

Submission Check List

• A report for Question 1-4 (Rmd and pdf), which can be long and technical.

• A short report for Question 5 (Rmd and pdf), which is limited to two pages.

• A video presentation (I need to see you in this video)

• Presentation slides for Question 6 (Rmd and pdf) • A compressed R package for Question 7 (optional)

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！