首页 >
> 详细

Research School of Finance, Actuarial Studies & Statistics

EXAMINATION

Semester 2 - Final, 2020

STAT3017/STAT7017 Big Data Statistics

Examination Duration: 48 Hours

Reading Time: 0 Minutes

Exam Conditions:

• This is a take home exam.

• All material is allowed.

• You should not communicate with other students about solutions.

Materials Permitted In The Exam Venue:

N/A

Materials to Be Supplied to Students:

• A RMarkdown notebook with some example code.

Instructions To Students:

This exam contains 5 pages in total: • Cover page (1 page)

• Final Exam (4 pages)

This ‘Final Exam’ has 4 questions for a total of 25 marks for STAT7017 students and 3

questions for a total of 20 marks for STAT3017 students.

Please attempt each question in the RMarkdown notebook and (if possible) return a ’Knit’ed

version of your answers as a PDF uploaded to Wattle.

Semester 2 - Final Exam, 2020 STAT3017/STAT7017 Big Data Statistics

Page i of i

STAT3017/STAT7017 Semester 2 - Final Exam, 2020

Question 1 [5 marks]

We start with some short theory questions:

(a) [1 marks] Suppose we had 250-dimensional observations x1, x2, . . ., x1000 that have

a multivariate normal distribution with mean zero and covariance 2Ip. From these

observations we construct the sample covariance matrix S and plot a histogram of the

eigenvalues of S. What distribution will approximate the density of the eigenvalues?

And with what parameters?

(b) [1 marks] What does the Fisher limiting spectral distribution (LSD) Ps ,t(x) describe?

And why does the left endpoint of this distribution converge to 14

(1 ∞ s)2 as t → 1?

(c) [2 marks] Consider a sequence of two-dimensional random vectors (xi )i ≥ 0 with

xi := (xi1, xi2)0 drawn from the bivariate normal distribution

xi ∼ N212 ,

2 1

1 1.

What is the asymptotic distribution of X := xn1 + x 2n2

as n → ∞?

(d) [1 marks] Why can we associate a distribution to the largest eigenvalue λ1 of a

sample covariance matrix S?

Question 2 [6 marks]

In this question, we are going to consider the topic of factor analysis where the aim is to

describe the covariance relationships among many variables in terms of a few underlying,

but unobservable, random quantities called factors. Consider n daily returns from 1 Jan

2018 to 1 Jan 2019 for p = 11 stocks: BHP, RIO, ANZ, NAB, CBA, WBC, GXY, NUF,

CGC, CGF, WSA. You can use the Rmd file I’ve provided to download this data.

Now implement the ”Principal Component Method” (PCM) found in Section 9.3 of [A]

using the correlation matrix R of the daily returns. That is:

(a) [1 marks] Determine the number of factors m using a screeplot and print out Le as

a table showing stock names as row labels and factor numbers as column headers.

(b) [1 marks] Print out the communalities h2i

, the estimated specific variances ψei

, and

the proportion of total sample variance due to the jth factor. Use these values to

argue that you’ve made the correct choice of m.

Now implement the ”Maximum Likelihood Method” (see p.495 in [A]) using the correlation

matrix R of the daily returns. For this part, you are allowed to use the inbuilt facanal

command in R that inplements the ML method by default.

(c) [1 marks] First, perform the Maximum Likelihood Method (MLM) method without

rotation using something like fit = factanal(daily.returns, factors = m, rota

tion=”none”, covmat=R) where m is the value of m you found previously and R is the

correlation matrix. Print out the loadings with print(fit) and compare them to

those found using the principal component method. What do you notice? Can you

give a label to one (or more) of these factors?

Page 1 of 4

STAT3017/STAT7017 Semester 2 - Final Exam, 2020

(d) [1 marks] We are now going to perform some factor rotations, see Section 9.4 in

[A]. Perform a varimax orthogonal rotation using varimax(tilde.L) where tilde.L

was derived using PCM. Now extract the MLM estimated loading matrix Lb from

fit using hat.L = loadings(fit). Now perform a varimax orthogonal rotation using

varimax(hat.L) and an oblique rotation using promax(hat.L). After rotation, do

the loadings group the stocks in the same manner? Which rotation do you prefer?

For your favourite rotation, can you give labels to the factors?

(e) [1 marks] Now perform a large sample test for the number of common factors by

testing the hypothesis H0 : Σ = LLT + Ψ with your choice of m at level α = 0.05.

The test is given in Eq. (9-39) of [A] and, since we are using the correlation matrix

R, it is based on the determinant of the matrix

R1(LbLbT + Ψ) (1) b

and uses a chi-square approximation to the sampling distribution. Implement this

test in R. For your choice of m, do you accept or reject the null hypothesis?

(f) [1 marks] Considering the theory we’ve learnt this semester, comment on the form

of (1) and the use of the chi-square approximation for the sampling distribution

for the situation where the number of stocks p became large and yn := p/n = 0.5.

What might be a better alternative for this high-dimensional case?

Question 3 [9 marks]

We are now going to consider the theory of spiked Fisher matrices from the recent paper

[B]. Consider two p-variate populations with covariance matrices Σ1 and Σ2 = Ip, and

let S1 and S2 be the sample covariance matrices for samples of the two populations with

degrees of freedom m and n, respectively. We set S := S1 2 S1.

(a) [1 marks] Suppose we had p-dimensional random variables x1, . . . , xm+1 ∼ Np(0, Σ1)

and p-dimensional random variables z1, . . . , zn+1 ∼ N(0, Ip). We stack these random

variables to obtain the data matrices X and Z and sample covariance matrices

S1 := 1m

XXT

, S2 := 1n

ZZT

, S := S1 2 S1.

Now assume n, m, p → ∞ such that yp := p/n → y ∈ (0, 1) and cp := p/m → c > 0.

For y = 1/2 and c = 1/4, what is the upper bound of the limiting spectral distribution

of S? [0.5 marks]. Plot the limiting spectral density of the eigenvalues of S [0.5

marks].

(b) [1 marks] Suppose that Σ1 = Σ2 + ∆ where ∆ = diag( n1 z }| { a1, . . . , a1, 0, . . . , 0) and

a1 > 0, i.e., Σ2 is perturbed by a rank n1 diagonal matrix ∆. What is the critical

value κ for which a1 > κ creates “outlier” sample eigenvalues? [0.5 marks]. Suppose

that a1 = κ + 1, c = 2/3 and y = 1/3, what value do you expect these outlier

eigenvalues cluster around? [0.5 marks].

Page 2 of 4

STAT3017/STAT7017 Semester 2 - Final Exam, 2020

(c) [1 marks] Continuing question (b), what would you expect to happen if a1 was only

slightly larger than 1 (and less than κ)?

(d) [1 marks] Perform a simulation experiment to illustrate the phenomena in (b) in the

case Σ2 = Ip. That is, sample data and plot a histogram of eigenvalues of S and

compare it to the density obtained in (a). Can you see outlier eigenvalues?

(e) [1 marks] Perform a simulation experiment to expirically calculate the power of the

method proposed in Section 7.1 of [B].

(f) [1 marks] Compare the results of your simulation experiment to the closed-form

formula given in Theorem 7.1 of [B].

(g) [2 marks] Consider the signal detection problem where we are trying to determine

the number of signals in observations of the form

xi = Usi + εi

, i = 1, . . . , m, (SD)

where the xi

’s are p-dimensional observations, si

is a k × 1 low dimensional signal

(k p) with covariance Ik , U is a p × k mixing matrix, and (εi ) is an i.i.d. noise

with covariance matrix Σ2. None of the quantities on the right hand side of (SD)

are observed. In [B], they propose to estimate the number of signals k by

kˆ := max{i : λi ≥ β + log(p/p2/3)},

where (λi ) are the eigenvalues of S. Reproduce Table 1 in [B] for the Gaussian case

for values p = 25, 75, 125, 175, 225, 275.

(h) [1 marks] Comment how the methods and theory considered in Question 3 might

apply to Question 2.

Question 4 [5 marks]

(STAT7017 students only) We are now going to consider a high-dimensional dataset of

hyperspectral data collected by the AVIRIS mission (https://aviris.jpl.nasa.gov/).

It is a unique optical sensor that delivers calibrated images of spectral radiance in p = 224

contiguous spectral channels (bands) with wavelengths from 400 to 2500 nanometers.

The main objective of the AVIRIS project is to identify, measure, and monitor constituents

of the Earth’s surface based on molecular absorption and particle scattering signatures.

We are going to consider preprocessed data where the radiance values have been converted

to surface reflectance (i.e., percentage of light reflected at that wavelength: 0 = 0%

reflected and 1 = 100% reflected) and a few bands have been removed so that p = 188.

A small spatial region of 250 × 190 pixels outside of Las Vegas is considered. Each pixel

represents roughly a 5m2 area on the ground. This gives us a datacube of dimension

(rows, cols, p) = (250, 190, 188).

(a) [1 marks] Consider the principal component analysis (PCA) of this data to reduce it

down to dimension p = 3. Compare the four results obtained from all combinations

Page 3 of 4

STAT3017/STAT7017 Semester 2 - Final Exam, 2020

of the following options: (1) data has been de-meaned and not, (2) the covariance

vs. the correlation matrix is used.

Often real-world datasets (such as this one!) exhibit data with heavier tails than the

Gaussian distribution and this motivated researchers to introduce alternative versions

of classic statistics that are less sensitive to outliers. One of these is Kendall’s τ that

replaces the covariance between two random variables with something more robust. This

idea can be extended to the multivariate setting as follows: Let x1, . . . , xn be independent

copies of a random vector x ∈ ℝp with coordinates x = (x

(1)

, x

(2)

, . . . , x(p))T

. Kendall’s

τ matrix T := (τk`) has entries given by

τk` := 1n2 X1≤i

sign(x(k) i × x(k) j

) sign(x(`) i × x(`) j ), 1 ≤ k, ` ≤ p. (2)

The matrix T is a popular replacement for the correlation matrix R. This begs the question:

how do the eigenvalues of T behave when n, p → ∞ such that p/n → y ∈ (0, 1)?

(b) [1 marks] Recently it was shown in [C] that the empirical spectral distribution of T

converges in probability to

23Y + 13, (3)

where Y is distributed according to the standard Marchenko-Pastur disribution with

parameter y . Perform a simulation experiment to plot a histogram of empirical

eigenvalues of T compared to the density of (3) in the case y = 0.5. Use the

function cor.fk from the pcaPP package for a fast implementation of (2).

(c) [1 marks] Perform a factor analysis with m = 3 factors on the AVIRIS data where

Kendall’s τ matrix T is used instead of the correlation matrix R. Use the fa function

from the psych package to do this as fit1 = fa(rR, nfactors = 3, rotate=’none’)

where rR is the T matrix. Compare no rotation to the ”geominT” and ”geominQ”

rotations.

(d) [1 marks] Do you think that m = 3 is the appropriate number of factors for this

AVIRIS dataset? Can you label the 3 factors for your favourite choice of rotation?

(e) [1 marks] How would you test for the appropriate number of factors?

Notation

Ip Identity matrix of size p × p. Np(0, Σ) p-dimensional multivariate Normal distribution with (vector) mean 0 and covariance Σ.

References

[A] Johnson, Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.

[B] Wang, Yao (2017). Extreme eigenvalues of large-dimensional spiked Fisher matrices with application.

Annals of Statistics, Vol 45, No. 1.

[C] Bandeira, Lodhia, Rigollet (2017). Marchenko-Pastur law for Kendall’s tau. Electronic Communications

in Probability, Vol 22, No. 32.

END OF EXAMINATION Page 4 of 4

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp2

- Cs2461-10实验程序代做、代写java，C/C++，Python编程设 2021-03-02
- 代写program程序语言、代做python，C++课程程序、代写java编 2021-03-02
- Programming课程代做、代写c++程序语言、Algorithms编程 2021-03-02
- 代写csc1-Ua程序、代做java编程设计、Java实验编程代做 代做留学 2021-03-02
- 代做program编程语言、代写python程序、代做python设计编程 2021-03-02
- 代写data编程设计、代做python语言程序、Python课程编程代写 代 2021-03-02
- Cse 13S程序实验代做、代写c++编程、C/C++程序语言调试 代写留学 2021-03-02
- Mat136h5编程代做、C/C++程序调试、Python，Java编程设计 2021-03-01
- 代写ee425x实验编程、代做python，C++，Java程序设计 帮做c 2021-03-01
- Cscc11程序课程代做、代写python程序设计、Python编程调试 代 2021-03-01
- 代写program编程、Python语言程序调试、Python编程设计代写 2021-03-01
- 代做r语言编程|代做database|代做留学生p... 2021-03-01
- Data Structures代写、代做r编程课程、代做r程序实验 帮做ha 2021-03-01
- 代做data留学生编程、C++，Python语言代写、Java程序代做 代写 2021-03-01
- 代写aps 105编程实验、C/C++程序语言代做 代写r语言程序|代写py 2021-03-01
- Fre6831 Computational Finance 2021-02-28
- Sta141b Assignment 5 Interactive Visu... 2021-02-28
- Eecs2011a-F20 2021-02-28
- Comp-251 Final Asssessment 2021-02-28
- 代写cs1027课程程序、代做java编程语言、代写java留学生编程帮做h 2021-02-28