STAT3017/STAT7017 Big Data Statistics

Research School of Finance, Actuarial Studies & Statistics

EXAMINATION

Semester 2 - Final, 2020

Examination Duration: 48 Hours

Reading Time: 0 Minutes

Exam Conditions:

• This is a take home exam.

• All material is allowed.

• You should not communicate with other students about solutions.

Materials Permitted In The Exam Venue:

N/A

Materials to Be Supplied to Students:

• A RMarkdown notebook with some example code.

Instructions To Students:

This exam contains 5 pages in total: • Cover page (1 page)

• Final Exam (4 pages)

This ‘Final Exam’ has 4 questions for a total of 25 marks for STAT7017 students and 3

questions for a total of 20 marks for STAT3017 students.

Please attempt each question in the RMarkdown notebook and (if possible) return a ’Knit’ed

version of your answers as a PDF uploaded to Wattle.

Semester 2 - Final Exam, 2020 STAT3017/STAT7017 Big Data Statistics

Page i of i

STAT3017/STAT7017 Semester 2 - Final Exam, 2020

Question 1 [5 marks]

We start with some short theory questions:

(a) [1 marks] Suppose we had 250-dimensional observations x1, x2, . . ., x1000 that have

a multivariate normal distribution with mean zero and covariance 2Ip. From these

observations we construct the sample covariance matrix S and plot a histogram of the

eigenvalues of S. What distribution will approximate the density of the eigenvalues?

And with what parameters?

(b) [1 marks] What does the Fisher limiting spectral distribution (LSD) Ps ,t(x) describe?

And why does the left endpoint of this distribution converge to 14

(1 ∞ s)2 as t → 1?

xi := (xi1, xi2)0 drawn from the bivariate normal distribution

xi ∼ N212 ,

2 1

1 1.

What is the asymptotic distribution of X := xn1 + x 2n2

as n → ∞?

(d) [1 marks] Why can we associate a distribution to the largest eigenvalue λ1 of a

sample covariance matrix S?

Question 2 [6 marks]

In this question, we are going to consider the topic of factor analysis where the aim is to

describe the covariance relationships among many variables in terms of a few underlying,

but unobservable, random quantities called factors. Consider n daily returns from 1 Jan

2018 to 1 Jan 2019 for p = 11 stocks: BHP, RIO, ANZ, NAB, CBA, WBC, GXY, NUF,

CGC, CGF, WSA. You can use the Rmd file I’ve provided to download this data.

Now implement the ”Principal Component Method” (PCM) found in Section 9.3 of [A]

using the correlation matrix R of the daily returns. That is:

(a) [1 marks] Determine the number of factors m using a screeplot and print out Le as

a table showing stock names as row labels and factor numbers as column headers.

(b) [1 marks] Print out the communalities h2i

, the estimated specific variances ψei

, and

the proportion of total sample variance due to the jth factor. Use these values to

argue that you’ve made the correct choice of m.

Now implement the ”Maximum Likelihood Method” (see p.495 in [A]) using the correlation

matrix R of the daily returns. For this part, you are allowed to use the inbuilt facanal

command in R that inplements the ML method by default.

rotation using something like fit = factanal(daily.returns, factors = m, rota

tion=”none”, covmat=R) where m is the value of m you found previously and R is the

correlation matrix. Print out the loadings with print(fit) and compare them to

those found using the principal component method. What do you notice? Can you

give a label to one (or more) of these factors?

Page 1 of 4

STAT3017/STAT7017 Semester 2 - Final Exam, 2020

(d) [1 marks] We are now going to perform some factor rotations, see Section 9.4 in

[A]. Perform a varimax orthogonal rotation using varimax(tilde.L) where tilde.L

was derived using PCM. Now extract the MLM estimated loading matrix Lb from

fit using hat.L = loadings(fit). Now perform a varimax orthogonal rotation using

varimax(hat.L) and an oblique rotation using promax(hat.L). After rotation, do

the loadings group the stocks in the same manner? Which rotation do you prefer?

For your favourite rotation, can you give labels to the factors?

(e) [1 marks] Now perform a large sample test for the number of common factors by

testing the hypothesis H0 : Σ = LLT + Ψ with your choice of m at level α = 0.05.

The test is given in Eq. (9-39) of [A] and, since we are using the correlation matrix

R, it is based on the determinant of the matrix

R1(LbLbT + Ψ) (1) b

and uses a chi-square approximation to the sampling distribution. Implement this

test in R. For your choice of m, do you accept or reject the null hypothesis?

(f) [1 marks] Considering the theory we’ve learnt this semester, comment on the form

of (1) and the use of the chi-square approximation for the sampling distribution

for the situation where the number of stocks p became large and yn := p/n = 0.5.

What might be a better alternative for this high-dimensional case?

Question 3 [9 marks]

We are now going to consider the theory of spiked Fisher matrices from the recent paper

[B]. Consider two p-variate populations with covariance matrices Σ1 and Σ2 = Ip, and

let S1 and S2 be the sample covariance matrices for samples of the two populations with

degrees of freedom m and n, respectively. We set S := S1 2 S1.

(a) [1 marks] Suppose we had p-dimensional random variables x1, . . . , xm+1 ∼ Np(0, Σ1)

and p-dimensional random variables z1, . . . , zn+1 ∼ N(0, Ip). We stack these random

variables to obtain the data matrices X and Z and sample covariance matrices

S1 := 1m

XXT

, S2 := 1n

ZZT

, S := S1 2 S1.

Now assume n, m, p → ∞ such that yp := p/n → y ∈ (0, 1) and cp := p/m → c > 0.

For y = 1/2 and c = 1/4, what is the upper bound of the limiting spectral distribution

of S? [0.5 marks]. Plot the limiting spectral density of the eigenvalues of S [0.5

marks].

(b) [1 marks] Suppose that Σ1 = Σ2 + ∆ where ∆ = diag( n1 z }| { a1, . . . , a1, 0, . . . , 0) and

a1 > 0, i.e., Σ2 is perturbed by a rank n1 diagonal matrix ∆. What is the critical

value κ for which a1 > κ creates “outlier” sample eigenvalues? [0.5 marks]. Suppose

that a1 = κ + 1, c = 2/3 and y = 1/3, what value do you expect these outlier

eigenvalues cluster around? [0.5 marks].

Page 2 of 4

STAT3017/STAT7017 Semester 2 - Final Exam, 2020

slightly larger than 1 (and less than κ)?

(d) [1 marks] Perform a simulation experiment to illustrate the phenomena in (b) in the

case Σ2 = Ip. That is, sample data and plot a histogram of eigenvalues of S and

compare it to the density obtained in (a). Can you see outlier eigenvalues?

(e) [1 marks] Perform a simulation experiment to expirically calculate the power of the

method proposed in Section 7.1 of [B].

(f) [1 marks] Compare the results of your simulation experiment to the closed-form

formula given in Theorem 7.1 of [B].

(g) [2 marks] Consider the signal detection problem where we are trying to determine

the number of signals in observations of the form

xi = Usi + εi

, i = 1, . . . , m, (SD)

where the xi

’s are p-dimensional observations, si

is a k × 1 low dimensional signal

(k p) with covariance Ik , U is a p × k mixing matrix, and (εi ) is an i.i.d. noise

with covariance matrix Σ2. None of the quantities on the right hand side of (SD)

are observed. In [B], they propose to estimate the number of signals k by

kˆ := max{i : λi ≥ β + log(p/p2/3)},

where (λi ) are the eigenvalues of S. Reproduce Table 1 in [B] for the Gaussian case

for values p = 25, 75, 125, 175, 225, 275.

(h) [1 marks] Comment how the methods and theory considered in Question 3 might

apply to Question 2.

Question 4 [5 marks]

(STAT7017 students only) We are now going to consider a high-dimensional dataset of

hyperspectral data collected by the AVIRIS mission (https://aviris.jpl.nasa.gov/).

It is a unique optical sensor that delivers calibrated images of spectral radiance in p = 224

contiguous spectral channels (bands) with wavelengths from 400 to 2500 nanometers.

The main objective of the AVIRIS project is to identify, measure, and monitor constituents

of the Earth’s surface based on molecular absorption and particle scattering signatures.

We are going to consider preprocessed data where the radiance values have been converted

to surface reflectance (i.e., percentage of light reflected at that wavelength: 0 = 0%

reflected and 1 = 100% reflected) and a few bands have been removed so that p = 188.

A small spatial region of 250 × 190 pixels outside of Las Vegas is considered. Each pixel

represents roughly a 5m2 area on the ground. This gives us a datacube of dimension

(rows, cols, p) = (250, 190, 188).

(a) [1 marks] Consider the principal component analysis (PCA) of this data to reduce it

down to dimension p = 3. Compare the four results obtained from all combinations

Page 3 of 4

STAT3017/STAT7017 Semester 2 - Final Exam, 2020

of the following options: (1) data has been de-meaned and not, (2) the covariance

vs. the correlation matrix is used.

Often real-world datasets (such as this one!) exhibit data with heavier tails than the

Gaussian distribution and this motivated researchers to introduce alternative versions

of classic statistics that are less sensitive to outliers. One of these is Kendall’s τ that

replaces the covariance between two random variables with something more robust. This

idea can be extended to the multivariate setting as follows: Let x1, . . . , xn be independent

copies of a random vector x ∈ ℝp with coordinates x = (x

(1)

, x

(2)

, . . . , x(p))T

. Kendall’s

τ matrix T := (τk`) has entries given by

τk` := 1n2 X1≤i

sign(x(k) i × x(k) j

) sign(x(`) i × x(`) j ), 1 ≤ k, ` ≤ p. (2)

The matrix T is a popular replacement for the correlation matrix R. This begs the question:

how do the eigenvalues of T behave when n, p → ∞ such that p/n → y ∈ (0, 1)?

(b) [1 marks] Recently it was shown in [C] that the empirical spectral distribution of T

converges in probability to

23Y + 13, (3)

where Y is distributed according to the standard Marchenko-Pastur disribution with

parameter y . Perform a simulation experiment to plot a histogram of empirical

eigenvalues of T compared to the density of (3) in the case y = 0.5. Use the

function cor.fk from the pcaPP package for a fast implementation of (2).

Kendall’s τ matrix T is used instead of the correlation matrix R. Use the fa function

from the psych package to do this as fit1 = fa(rR, nfactors = 3, rotate=’none’)

where rR is the T matrix. Compare no rotation to the ”geominT” and ”geominQ”

rotations.

(d) [1 marks] Do you think that m = 3 is the appropriate number of factors for this

AVIRIS dataset? Can you label the 3 factors for your favourite choice of rotation?

(e) [1 marks] How would you test for the appropriate number of factors?

Notation

Ip Identity matrix of size p × p. Np(0, Σ) p-dimensional multivariate Normal distribution with (vector) mean 0 and covariance Σ.

References

[A] Johnson, Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.

[B] Wang, Yao (2017). Extreme eigenvalues of large-dimensional spiked Fisher matrices with application.

Annals of Statistics, Vol 45, No. 1.

[C] Bandeira, Lodhia, Rigollet (2017). Marchenko-Pastur law for Kendall’s tau. Electronic Communications

in Probability, Vol 22, No. 32.

END OF EXAMINATION Page 4 of 4

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 bio215 session 4 - prac... 2025-10-20
辅导 cp2403 assignment 1调试... 2025-10-20
讲解 9230 contemporary leade... 2025-10-20
辅导 construction estimating... 2025-10-20
讲解 tutorial # 2 – env221h... 2025-10-20
讲解 padm-gp 2144 debt finan... 2025-10-20
讲解 31272 project managemen... 2025-10-20
讲解 assignment 2 analyzing ... 2025-10-20
讲解 mgtf-415 homework 1: re... 2025-10-20
辅导 cop30 simulation assess... 2025-10-20
讲解 comp sci 2208 practical... 2025-10-20
辅导 design theory and cultu... 2025-10-20
辅导 global development辅导 ... 2025-10-20
讲解 acct90012 corporate rep... 2025-10-20
辅导 iet 33620 total product... 2025-10-20
辅导 eecs 492 probability an... 2025-10-20
讲解 eecs 492 --introduction... 2025-10-20
辅导 what is education?调试r... 2025-10-18
讲解 bio202 advanced biochem... 2025-10-18
讲解 c19bv business venturin... 2025-10-18

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！