Mid-term Project of 6289 (Due: Oct 29, 2019) Name:

• You need to submit your answers before 2:00pm of Oct 29.

• You may talk with one another about the project, but the work you turn in should be

your own.

• Use a word processor that can handle mathematics (like LATEX or word) and can

include graphics. No handwriting is accepted.

• Finalize your codes (with file name: yourname-6289.*) and email me a copy

for verifying

1. (80) Duchenne Muscular Dystrophy (DMD) is a sex-linked genetic disease. Boys with

the disease usually die at a young age, while affected girls usually do not suffer symptoms

and may unknowingly carry the disease and pass it to their offspring. It is

desirable to have some kind of test to detect whether or not a woman is a carrier of

the disease. The dataset dystrophy.txt contains information from a 1981 study attempting

to develop such a test based on two serum enzymes, creatine kinase (CK) and

hemopexin (H) for 38 known DMD carriers (Case) and 82 women who are not carriers

(Control). (Note: In the last 30 years, advances in DNA sequencing technology has

made it possible obtain definitive answers; however, tests based on the above proteins

are still used as rapid and inexpensive alternatives).

(a) Use logistic regression to model the way in which case/control status depends

on creatine kinase and hemopexin. Construct (using the Wald approach) a table

containing the estimated odds ratios and p-values for the two enzymes. Provide

confidence intervals for the odds ratios, and give some thought as to what would

constitute a meaningful difference (δj ) for the two enzymes when calculating the

odds ratios.

(b) Can you calculate confidence intervals for the odds ratios in part (a) using the

likelihood ratio approach? If so, calculate them. If not, explain why you can’t do

so.

(c) Can you carry out the hypothesis testing in part (a) using the likelihood ratio

approach? If so, perform the tests. If not, explain why you can’t do so.

(d) Describe (quantitatively) the relationship between creatine kinase levels and the

likelihood that a woman is a carrier without using the phrase “odds ratio” (you

can use “odds”, just not “odds ratio”).

(e) Suppose a woman randomly selected from the population has a hemopexin level

of 100 and a creatine kinase level of 150. Can you estimate the probability that

she is a carrier? If so, estimate it. If not, explain why you can’t do so.

(f) It is estimated that 1 in 3,300 women are carriers. Treating this as a known

constant, calculate the sampling ratio τ1/τ0.

(g) Based on your answer to (f), calculate the probability from part (e).

(h) Compare1 the following three numbers: (i) the probability you calculated in (g),

and (ii) the marginal probability of being a carrier (i.e., if you don’t know a

woman’s hemopexin/creatine kinase levels).

2. (20) Consider a binary response variable Y and logistic regression. We focus on the

group Lasso with loss function given by the negative log-likelihood as (see equation

(3.3) in the HDDA book as well)

Write the block coordinate gradient descent algorithm (Algorithm 3 in the HDDA

book) with explicit formulae (see equations (4.20) and (4.21) in the HDDA book,

where 0 < δ < 1, 0 < σ < 1, and ∆[m]

is the improvement in the objective function

Qλ(·) when using a linear approximation for the objective function, i.e.,

