首页 > > 详细

STAT3017/STAT7017 Big Data Statistics

 

Research School of Finance, Actuarial Studies & Statistics
EXAMINATION
Semester 2 - Final, 2020
STAT3017/STAT7017 Big Data Statistics
Examination Duration: 48 Hours
Reading Time: 0 Minutes
Exam Conditions:
• This is a take home exam.
• All material is allowed.
• You should not communicate with other students about solutions.
Materials Permitted In The Exam Venue:
N/A
Materials to Be Supplied to Students:
• A RMarkdown notebook with some example code.
Instructions To Students:
This exam contains 5 pages in total: • Cover page (1 page)
• Final Exam (4 pages)
This ‘Final Exam’ has 4 questions for a total of 25 marks for STAT7017 students and 3
questions for a total of 20 marks for STAT3017 students.
Please attempt each question in the RMarkdown notebook and (if possible) return a ’Knit’ed
version of your answers as a PDF uploaded to Wattle.
Semester 2 - Final Exam, 2020 STAT3017/STAT7017 Big Data Statistics
Page i of i
STAT3017/STAT7017 Semester 2 - Final Exam, 2020
Question 1 [5 marks]
We start with some short theory questions:
(a) [1 marks] Suppose we had 250-dimensional observations x1, x2, . . ., x1000 that have
a multivariate normal distribution with mean zero and covariance 2Ip. From these
observations we construct the sample covariance matrix S and plot a histogram of the
eigenvalues of S. What distribution will approximate the density of the eigenvalues?
And with what parameters?
(b) [1 marks] What does the Fisher limiting spectral distribution (LSD) Ps ,t(x) describe?
And why does the left endpoint of this distribution converge to 14
(1 ∞ s)2 as t → 1?
(c) [2 marks] Consider a sequence of two-dimensional random vectors (xi )i ≥ 0 with
xi := (xi1, xi2)0 drawn from the bivariate normal distribution
xi ∼ N212 , 
2 1
1 1.
What is the asymptotic distribution of X := xn1 + x 2n2
as n → ∞?
(d) [1 marks] Why can we associate a distribution to the largest eigenvalue λ1 of a
sample covariance matrix S?
Question 2 [6 marks]
In this question, we are going to consider the topic of factor analysis where the aim is to
describe the covariance relationships among many variables in terms of a few underlying,
but unobservable, random quantities called factors. Consider n daily returns from 1 Jan
2018 to 1 Jan 2019 for p = 11 stocks: BHP, RIO, ANZ, NAB, CBA, WBC, GXY, NUF,
CGC, CGF, WSA. You can use the Rmd file I’ve provided to download this data.
Now implement the ”Principal Component Method” (PCM) found in Section 9.3 of [A]
using the correlation matrix R of the daily returns. That is:
(a) [1 marks] Determine the number of factors m using a screeplot and print out Le as
a table showing stock names as row labels and factor numbers as column headers.
(b) [1 marks] Print out the communalities h2i
, the estimated specific variances ψei
, and
the proportion of total sample variance due to the jth factor. Use these values to
argue that you’ve made the correct choice of m.
Now implement the ”Maximum Likelihood Method” (see p.495 in [A]) using the correlation
matrix R of the daily returns. For this part, you are allowed to use the inbuilt facanal
command in R that inplements the ML method by default.
(c) [1 marks] First, perform the Maximum Likelihood Method (MLM) method without
rotation using something like fit = factanal(daily.returns, factors = m, rota
tion=”none”, covmat=R) where m is the value of m you found previously and R is the
correlation matrix. Print out the loadings with print(fit) and compare them to
those found using the principal component method. What do you notice? Can you
give a label to one (or more) of these factors?
Page 1 of 4
STAT3017/STAT7017 Semester 2 - Final Exam, 2020
(d) [1 marks] We are now going to perform some factor rotations, see Section 9.4 in
[A]. Perform a varimax orthogonal rotation using varimax(tilde.L) where tilde.L
was derived using PCM. Now extract the MLM estimated loading matrix Lb from
fit using hat.L = loadings(fit). Now perform a varimax orthogonal rotation using
varimax(hat.L) and an oblique rotation using promax(hat.L). After rotation, do
the loadings group the stocks in the same manner? Which rotation do you prefer?
For your favourite rotation, can you give labels to the factors?
(e) [1 marks] Now perform a large sample test for the number of common factors by
testing the hypothesis H0 : Σ = LLT + Ψ with your choice of m at level α = 0.05.
The test is given in Eq. (9-39) of [A] and, since we are using the correlation matrix
R, it is based on the determinant of the matrix
R1(LbLbT + Ψ) (1) b
and uses a chi-square approximation to the sampling distribution. Implement this
test in R. For your choice of m, do you accept or reject the null hypothesis?
(f) [1 marks] Considering the theory we’ve learnt this semester, comment on the form
of (1) and the use of the chi-square approximation for the sampling distribution
for the situation where the number of stocks p became large and yn := p/n = 0.5.
What might be a better alternative for this high-dimensional case?
Question 3 [9 marks]
We are now going to consider the theory of spiked Fisher matrices from the recent paper
[B]. Consider two p-variate populations with covariance matrices Σ1 and Σ2 = Ip, and
let S1 and S2 be the sample covariance matrices for samples of the two populations with
degrees of freedom m and n, respectively. We set S := S1 2 S1.
(a) [1 marks] Suppose we had p-dimensional random variables x1, . . . , xm+1 ∼ Np(0, Σ1)
and p-dimensional random variables z1, . . . , zn+1 ∼ N(0, Ip). We stack these random
variables to obtain the data matrices X and Z and sample covariance matrices
S1 := 1m
XXT
, S2 := 1n
ZZT
, S := S1 2 S1.
Now assume n, m, p → ∞ such that yp := p/n → y ∈ (0, 1) and cp := p/m → c > 0.
For y = 1/2 and c = 1/4, what is the upper bound of the limiting spectral distribution
of S? [0.5 marks]. Plot the limiting spectral density of the eigenvalues of S [0.5
marks].
(b) [1 marks] Suppose that Σ1 = Σ2 + ∆ where ∆ = diag( n1 z }| { a1, . . . , a1, 0, . . . , 0) and
a1 > 0, i.e., Σ2 is perturbed by a rank n1 diagonal matrix ∆. What is the critical
value κ for which a1 > κ creates “outlier” sample eigenvalues? [0.5 marks]. Suppose
that a1 = κ + 1, c = 2/3 and y = 1/3, what value do you expect these outlier
eigenvalues cluster around? [0.5 marks].
Page 2 of 4
STAT3017/STAT7017 Semester 2 - Final Exam, 2020
(c) [1 marks] Continuing question (b), what would you expect to happen if a1 was only
slightly larger than 1 (and less than κ)?
(d) [1 marks] Perform a simulation experiment to illustrate the phenomena in (b) in the
case Σ2 = Ip. That is, sample data and plot a histogram of eigenvalues of S and
compare it to the density obtained in (a). Can you see outlier eigenvalues?
(e) [1 marks] Perform a simulation experiment to expirically calculate the power of the
method proposed in Section 7.1 of [B].
(f) [1 marks] Compare the results of your simulation experiment to the closed-form
formula given in Theorem 7.1 of [B].
(g) [2 marks] Consider the signal detection problem where we are trying to determine
the number of signals in observations of the form
xi = Usi + εi
, i = 1, . . . , m, (SD)
where the xi
’s are p-dimensional observations, si
is a k × 1 low dimensional signal
(k  p) with covariance Ik , U is a p × k mixing matrix, and (εi ) is an i.i.d. noise
with covariance matrix Σ2. None of the quantities on the right hand side of (SD)
are observed. In [B], they propose to estimate the number of signals k by
kˆ := max{i : λi ≥ β + log(p/p2/3)},
where (λi ) are the eigenvalues of S. Reproduce Table 1 in [B] for the Gaussian case
for values p = 25, 75, 125, 175, 225, 275.
(h) [1 marks] Comment how the methods and theory considered in Question 3 might
apply to Question 2.
Question 4 [5 marks]
(STAT7017 students only) We are now going to consider a high-dimensional dataset of
hyperspectral data collected by the AVIRIS mission (https://aviris.jpl.nasa.gov/).
It is a unique optical sensor that delivers calibrated images of spectral radiance in p = 224
contiguous spectral channels (bands) with wavelengths from 400 to 2500 nanometers.
The main objective of the AVIRIS project is to identify, measure, and monitor constituents
of the Earth’s surface based on molecular absorption and particle scattering signatures.
We are going to consider preprocessed data where the radiance values have been converted
to surface reflectance (i.e., percentage of light reflected at that wavelength: 0 = 0%
reflected and 1 = 100% reflected) and a few bands have been removed so that p = 188.
A small spatial region of 250 × 190 pixels outside of Las Vegas is considered. Each pixel
represents roughly a 5m2 area on the ground. This gives us a datacube of dimension
(rows, cols, p) = (250, 190, 188).
(a) [1 marks] Consider the principal component analysis (PCA) of this data to reduce it
down to dimension p = 3. Compare the four results obtained from all combinations
Page 3 of 4
STAT3017/STAT7017 Semester 2 - Final Exam, 2020
of the following options: (1) data has been de-meaned and not, (2) the covariance
vs. the correlation matrix is used.
Often real-world datasets (such as this one!) exhibit data with heavier tails than the
Gaussian distribution and this motivated researchers to introduce alternative versions
of classic statistics that are less sensitive to outliers. One of these is Kendall’s τ that
replaces the covariance between two random variables with something more robust. This
idea can be extended to the multivariate setting as follows: Let x1, . . . , xn be independent
copies of a random vector x ∈ ℝp with coordinates x = (x
(1)
, x
(2)
, . . . , x(p))T
. Kendall’s
τ matrix T := (τk`) has entries given by
τk` := 1n2 X1≤i
sign(x(k) i × x(k) j
) sign(x(`) i × x(`) j ), 1 ≤ k, ` ≤ p. (2)
The matrix T is a popular replacement for the correlation matrix R. This begs the question:
how do the eigenvalues of T behave when n, p → ∞ such that p/n → y ∈ (0, 1)?
(b) [1 marks] Recently it was shown in [C] that the empirical spectral distribution of T
converges in probability to
23Y + 13, (3)
where Y is distributed according to the standard Marchenko-Pastur disribution with
parameter y . Perform a simulation experiment to plot a histogram of empirical
eigenvalues of T compared to the density of (3) in the case y = 0.5. Use the
function cor.fk from the pcaPP package for a fast implementation of (2).
(c) [1 marks] Perform a factor analysis with m = 3 factors on the AVIRIS data where
Kendall’s τ matrix T is used instead of the correlation matrix R. Use the fa function
from the psych package to do this as fit1 = fa(rR, nfactors = 3, rotate=’none’)
where rR is the T matrix. Compare no rotation to the ”geominT” and ”geominQ”
rotations.
(d) [1 marks] Do you think that m = 3 is the appropriate number of factors for this
AVIRIS dataset? Can you label the 3 factors for your favourite choice of rotation?
(e) [1 marks] How would you test for the appropriate number of factors?
Notation
Ip Identity matrix of size p × p. Np(0, Σ) p-dimensional multivariate Normal distribution with (vector) mean 0 and covariance Σ.
References
[A] Johnson, Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.
[B] Wang, Yao (2017). Extreme eigenvalues of large-dimensional spiked Fisher matrices with application.
Annals of Statistics, Vol 45, No. 1.
[C] Bandeira, Lodhia, Rigollet (2017). Marchenko-Pastur law for Kendall’s tau. Electronic Communications
in Probability, Vol 22, No. 32.
END OF EXAMINATION Page 4 of 4
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!