Assessment 5 (Due by 23:59pm 31st Oct 2024)
The following three questions are related to four references listed at the end of this question sheet. Please read the four papers and work on these questions. Note that you do not need go through all the contents of these papers. Please extract useful information from these papers, aiming to working out the assignment questions.
Question 1 [40 marks]
Consider two p-dimensional populations with covariance matrices Ip and (Ip + ∆), where
(1)
with δ1, δ2 ∈ R. Suppose we have p-dimensional random samples x1, x2, . . . , xm+1 from normal distribution N (0p, Ip) and p-dimensional random samples z1, z2, . . . , zn+1 also from normal distribution N (0p, Ip + ∆). We stack these random samples to obtain the data matrices X and Z and sample covariance matrices
(2)
(a). [5 marks]. Assume n, m, p → ∞ such that yn := p/n → y ∈ (0, 1) and cm := p/m → c > 0. Take δ1 = δ2 = 0, y = 1/4, and c = 3/4, what is the lower bound a and the upper bound b of the limiting spectral distribution of S? For each, give a formula in terms of c and y.
(b). [5 marks]. Suppose that δ1 = −ε and δ2 = ε with ε = 1/10. Would you expect S to have eigenvalues smaller than a and larger than b in this case? Please provide your reason.
(c). [10 marks]. In the paper Han et al. (2016), it is suggested that the largest eigenvalue λ1 of S, scaled as (λ1 − b)/sp, where b is from Question 1(a) and , behaves like a Tracy-Widom distribution of order 1. Show this using a simulation in the case n = 400, yn = 1/4 and cm = 3/4. Plot the histogram and compare it against the Tracy-Widom distribution of order 1.
(d). [10 marks]. The paper Wang and Yao (2017) also study the extreme eigenvalues of Fisher matrices. Suppose that δ1 < ℓ and δ2 > κ for some choice of ℓ and κ. What would be the critical values of ℓ and κ that would ensure you would have a large fundamental spike and a small fundamental spike? Give a formula for ℓ and κ. Also provide a simulation to give a numerical value for the formula in the case y = 1/4 and c = 3/4.
(e). [10 marks]. Suppose that δ1 = ℓ − 1/100 and δ2 = κ + 1/100 for your critical values of κ and ℓ you found in (d), then give a formula for each of the two locations where you think the spike eigenvalues will cluster around and also a numerical value for each. Also, perform. a simulation experiment to illustrate this phenomenon. That is, sample data and plot a histogram of eigenvalues of S, compare it to the theoretical density expected if δ1 = δ2 = 0, and plot the location where you expect spike eigenvalues to cluster around. Take n = 400, yn = 1/4, and cn = 3/4.
Question 2 [40 marks]
In this question, we consider high-dimensional sample covariance matrices of data that is sampled from an elliptical distribution. We say that a random vector x with zero mean follows an elliptical distribution if and only if it has the stochastic representation
(3)
where the matrix A ∈ R
p×p
is nonrandom and rank(A) = p, ξ ≥ 0 is a random variable representing the radius of x, and u ∈ Rp
is the random direction, which is independent of ξ and uniformly distributed on the unit sphere S
p−1
in R, denoted by u ∼ Unif(S
p−1). The class of elliptical distributions is a natural generalization of the multivariate normal distribution, and contains many widely used distributions as special cases including the multivariate t-distribution, the symmetric multivariate Laplace distribution and the symmetric multivariate stable distribution.
(a). [20 marks]. Write a function runifsphere(n, p) that samples n observations from the distribution Unif(S
p−1) using the fact that if z ∼ N (0p, Ip), then z/||z|| ∼ Unif(S
p−1
). Check your results by: (1) set p = 25, n = 50 and show that the Euclidean norm of each observation is equal to 1; (2) generate a scatter plot in the case p = 2, n = 500 to show that the samples lie on a circle.
Show that you can simulate a multivariate t-distribution tν(0p, Ip) by setting in (3) with A = Ip and Do this by sampling observations x1, x2, . . . , xn and comparing the two marginal histograms of the observations against the density of the univariate tν distribution.
(b). [20 marks]. Suppose that x1, x2, . . . , xn are p-dimensional observations sampled from an elliptic distribution in (3). We stack these observations into the data matrix X and calculate the sample covariance matrix Theorem 2.2 of the paper Hu et al. (2019) is a central limit theorem for linear spectral statistics (LSS) of Sn. For example, the equation (2.10) in Hu et al. (2019) provides the case of the joint distribution of the LSS ϕ1(x) = x and ϕ2(x) = x2. Following the notation used there (for all the following terms in this equation). Perform. a simulation experiment to examine the fluctuations of and . In the experiment, take with δ(x) being Dirac delta function, and choose the distribution of ξ ∼ k1 · Gamma(p, 1) with . Set the dimensions to be p = 200 and n = 400. Choose the number of simulations based on the computational power of your machine. Similar to Figure 1 in the paper Hu et al. (2019), use a QQ-plot to show normality.
Question 3 Only for STAT6017 Students [20 marks]
The results of the paper Hu et al. (2019) can not cover all elliptic distributions due to a moment condition on the population distribution, see Table 1 in Hu et al. (2019). The results in the paper Zhang et al. (2022) extend their results to more general elliptic distributions such as multivariate Gaussian mixtures. A p-dimensional vector x ∈ R
p is a multivariate Gaussian mixture with k sub-populations if its density function has the form.
(4)
where pj
, j = 1, . . . , k are the k mixing weights and ϕ(·; µj
, Σj ) denote the density function of the j-th sub-population with mean vector µj and covariance matrix Σj
. In the case where µ1 = µ2 = · · · = µk = 0 ∈ R
p and Σj = vj
· Σ for some vj > 0 with j = 1, . . . , k. Write an R function to sample from such a distribution using the representation from the equation (11) in the paper Zhang et al. (2022).
Note: This homework is to be submitted through Wattle in digital form. only as per ANU policy. The R codes for any computational question must be supplied.
References
Han, X., G. Pan, and B. Zhang (2016). The tracy-widom law for the largest eigenvalue of f-type matrices. The Annals of Statistics 44 (4), 1564–1592.
Hu, J., W. Li, Z. Liu, and W. Zhou (2019). High-dimensional covariance matrices in elliptical distributions with application to spherical test. The Annals of Statis-tics 47 (1), 527–555.
Wang, Q. and J. Yao (2017). Extreme eigenvalues of large-dimensional spiked fisher matrices with application. The Annals of Statistics 45 (1), 415–460.
Zhang, Y., J. Hu, and W. Li (2022). Clt for linear spectral statistics of high-dimensional sample covariance matrices in elliptical distributions. Journal of Multivariate Anal-ysis 191, 105007.