讲解数据结构、讲解 Simulation Studies, Integration

MAT 4374 Assignment 3 - Simulation Studies, Integration
1. Please see the guidelines on the class website for formatting your assignment. If the assign-
ment is not formatted adequately, you may lose marks. In particular, the marker won’t be
searching your assignment for the answer/ gure/code. You need to clearly indicate each
of these.
2. Feel free to work in pairs on any of the programming portions of the question. Just be sure
to write up your solution separately.
Questions:
1. Robustness of the two sample t-test. Recall the two sample t-test with equal variance
that is used to test the null hypothesis that X = Y :
Let X1;X2;:::;Xn N( x; 2) and Y1;Y2;:::;Ym N( Y; 2). Then
where X and Y are the sample means; S2X and S2Y are the sample variances and
Note that this statistical test is valid if the following two assumptions are met: (1) the
underlying distributions are Normal and (2) the variances 2 are equal.
Use a simulation study to investigate the e ects of violation of these assumptions on the
type 1 error rate and power of the test. Also examine how unequal sample sizes from the
two groups a ect the results. Write a short report summarizing your ndings. Be sure to
clearly describe your simulation procedure and summarize results with the help of summary
statistics, gures and/or tables.
2. Let X U(1;a), a> 1, and let Y = f(X) = a 1X .
(a) Show that E[f(X)] = ln(a).
(b) Use the R function integrate to approximate E[Y] for a = (1:1;1:5;2:5;4). Create
a table that shows the true value, the value approximated with the R function, and
the error.
1
MAT4374/5182 Assignment 3
(c) Use the Trapezoidal rule and Simpson’s rule to to approximate E[Y] for the same
values of a as part (b). Create a table for each approximation showing the error
in estimates for di erent numbers of subintervals. Compare the accuracy of the two
methods at equivalent number of subintervals. Compare run times of the two methods
at equivalent number of subintervals.
3. Estimate 2 = E[X2] when X has density proportional to g(x) = exp( jxj3=3) using:
(a) Monte Carlo integration using rejection sampling to sample from g(x).
(b) Importance sampling (weights must be standardized as we don’t know normalization
constant)
4. (Adapted from question 7.5 of Givens/Hoeting). A Bayesian analysis of a real dataset
(breastcancer.dat; information provided in breastcancer.txt)
(a) Read the summary information on the model provided starting on the next page.
(b) Program a Gibbs sampler to sample and from the full conditionals that are given
in the summary information. Note that the dataset has the variable censoring equal
to 1 if the observation is censored. This is slightly di erent from the summary below.
(c) Evaluate convergence and mixing of your sampler using the R package coda.
(d) Provide estimates of the mean, variance and a 95% probability interval for both and
(e) Are the recurrence times for the hormone group di erent from those of the control
group? How would you advise the drug company?
MAT4374/5182 Assignment 3
Survival analysis of the Breast Cancer data
Question 7.5 in Givens/Hoeting:
A clinical trial was conducted to determine whether a hormone treatment bene ts
women who were treated previously for breast cancer. Each subject entered the clin-
ical trial when she was diagnosed with cancer. She was then treated by irradiation
and assigned to either a hormone therapy group or a control group. The obser-
vation of interest is the time until a second recurrence, which may be assumed to
follow an exponential distribution with parameter (hormone group) or (control
group). Many of the women did not have a second recurrence before the clinical trial
concluded, so that their recurrence times are censored.
The rest of the question asks you to complete a Bayesian analysis of the dataset. The following
provides a little bit of background for completing the question, and I give you the full conditionals
for the Gibbs sampler.
The variable of interest is T i , the time to recurrence of breast cancer. Assume that these
times are exponentially distributed with a rate i that depends on whether the woman has been
treated with the hormone or not. That is
T i Exp( i)
where
i =
if treated with hormone
if a control :
Therefore, if there is a di erence between the treated and control groups, we would expect a
6= 1.
Let the number of women in the hormone group be nh and the number in the control group
be nc. Let n = nh+nc be the total sample size. Also, let the setHindex women in the hormone
group and the set C index women in the control group. If all recurrence times were observed, we
could write the likelihood as follows:
The challenge with this dataset is that some of the times are censored. For censored times,
we don’t know when the recurrence time is { all we know is that it is greater than the censoring
time (time when our observation of the patient stops).
Let the data be represented by (Ti; i) where Ti is the observed time and i is an indicator
for whether the observed time is a recurrence time or a censoring time. Let i = 1 if the time is
a recurrence time and 0 if it is a censoring time.
MAT4374/5182 Assignment 3
If an observation is censored at time ti all we know is that the unobserved recurrence time,
T i , is greater than ti. So, the likelihood (probability of the data) for an individual whose time
was censored, is:
P(T i ti) = 1 P(T i = 1 (1 exp( iti))= exp( iti)
and therefore the likelihood for a given individual is
Li =exp( ti) i = 0i exp( iti) i = 1= ii exp( iti)
Pulling everything together, we have
(The last line is the same as the likelihood given on page 232 of GH).
The data will be analysed using a Bayesian approach, which means that inference is based
on the posterior distribution of and . Recall (from class) that the posterior distribution is
proportional to the likelihood multiplied by the prior distribution. The prior distribution will be:
f( ; )/ a b exp( c d ) (2)
with (a;b;c;d) = (3;1;60;120) and the posterior distribution is eqn (1) multiplied by eqn (2):
We are interested in the posterior mean for and , so we use Gibbs sampling to estimate these
values. For Gibbs sampling, we need to nd the two full conditional distributions: f( j ;T; )
and f( j ;T; ).
Their parametric form. can be easily found by looking at the posterior distribution (eqn 3).
For f( j ;T; ), collect terms in eqn (3) that include . Terms that don’t involve are constant
in this conditional distribution and they can be ignored (part of normalizing constant).
MAT4374/5182 Assignment 3
Note that eqn (4), the full conditional for , has the form. of a gamma distribution with
parameters b + Pi2H i + 1 and d + Pi2Hti (inverse scale parameterization).Similarly, the
full conditional for is a gamma distribution with parameters a + Pni=1 i + 1 and c + d +
Pi2Hti +Pi2Cti.
These are the full conditionals to be used when programming the Gibbs sampler.