MATH6147 { R and Spreadsheet Modelling Coursework
This coursework will count for 70% of the assessment for MATH6147.
Your coursework should be submitted electronically via blackboard.
The deadline for the coursework is 3pm on Friday 17 November 2017. The time of
the electronic submission is taken from blackboard.
The deadline is strict. Any work handed in after this time will be subject to the following
penalties: 10% of your marks lost per working day up to 5 working days.
Remember to back up your work regularly. No additional time will be given for lost
computer les.
This work must be your own. Any references used must be cited and any help re-
ceived from others should be duly acknowledged. You are reminded that cheating and
plagiarism are treated very seriously by the Department of Mathematics and by the
University.
Electronic submissions
You are expected to submit one Excel le, one R script. and one Word (or pdf) document
electronically.
Please name your les using the nal eight digits of your student ID number, e.g.
MATH6147_12345678.xlsx. Do not include your name.
Your le should be uploaded to the MATH6147 Data Analytics Blackboard site.
Please also keep a copy of your Excel model and R script. in case there is a problem with
your electronic submission.
1
R and Excel Assignment: Estimating the
Duration of HIV Infection
1 General Restriction
Files with extension .xlsm will not be opened by the markers. (These are Excel les that
include macros.)
2 Background
Human immunode ciency virus (HIV) is a particularly deadly virus due to its high mortality
rate and long duration. Without treatment, the virus progresses through three main stages:
primary HIV infection, which covers the period from initial infection until a detectable an-
tibody response, and lasts 2{6 weeks; chronic HIV infection, during which the infectee may
have few symptoms of HIV infection, and has a mean duration of 10 years; and nally clin-
ical AIDS, when symptom{onset occurs and which lasts on average 10{12 months without
treatment.
The aim of this coursework is to perform. some analysis on the time between HIV infection
and death for untreated patients. This begins with an analysis of symptom onset{to{death
data. A dataset can be found in the .csv le on Blackboard.
Let’s assume that the event describing (a full blown) symptom{onset to death can be described
by a log{normal distribution. The probability density function (pdf) of the log{normal dis-
tribution is
f(x; ; ) = 1x p2 e (log(x) )
2
2 2 ; (1)
with support x2R++ and parameters 2R; 2R++.
If we assume that the observed data xi;i = 1;:::;n were generated using a log{normal
distribution, we can estimate and in two ways:
1. Use a numerical solver to nd the maximum likelihood estimate using the pdf of the
log{normal distribution given above in Equation (1).
2. Recognising its relationship with the normal distribution, use the following equations to
estimate its value:
^ =Pni=1 log(xi)n and ^ 2 =Pni=1(log(xi) ^ )2n : (2)
We are then able to incorporate these results into a full model of the time between HIV
infection and death. As described above, this consists of three periods, each with their own
probability distribution.
1. Primary HIV infection: Uniform. distribution with minimum 2 weeks, maximum 6 weeks.
2. Chronic HIV infection: Weibull distribution with shape parameter 2 and scale parameter
3. Symptom{onset to death: log{normal distribution with parameters as tted from the
data.
The uniform. distribution with minimum at a and maximum at b has the following pdf:
f(x) =1
(b a) a 0, where k is the shape parameter and is the scale parameter.
3 Tasks
The rst set of tasks are to describe the data given in the .csv le on Blackboard and nd
estimates of and using both Equations (1) and (2).
1. Produce an Excel worksheet that is carefully laid out to allow a user to easily navigate
their way around it. The spreadsheet should be designed to work with a new set of data
if this becomes available, where the new data may have a di erent number of entries.
It should include the following features (20 marks):
A table showing key summary statistics for the data;
One or more charts displaying the distribution of the data;
Estimates of and using Equation (1) and Equation (2) above.
2. Produce an R script. that does the following (15 marks):
Calculates key summary statistics for the data;
Outputs one or more charts displaying the distribution of the data;
Fits a log{normal distribution to the data in the .csv le.
In the second part of the work you need to simulate the total time from infection to death.
To do this, you will need to make use of the following results about how to sample log{normal
and Weibull random variables.
Assume that Z is sampled from a standard normal distribution with mean 0 and variance 1.
Then, X will be a sample from a log{normal distribution with parameters and if we set
X toX = e + Z:3
Assume that U is a sample from a uniform. distribution between 0 and 1 (this is what the
RAND() function outputs in Excel). Then, Y will be a sample from a Weibull distribution
with parameters k and if we set Y to
Y = [ ln(U)]1=k:
(Explanations for why these results are true will be covered in MATH6004 simulation)
3. Add extra features to the spreadsheet developed in part 1 to complete the following
tasks (25 marks).
Generation of a user{de ned number of random samples of the time between in-
fection and death from HIV, where this is equal to the time spent in primary HIV
+ the time spent in chronic HIV + the time spent between symptom-onset and
death;
Automatic calculation of summary statistics for the random sample;
Any charts that may be useful in explaining the output.
4 What You Should Submit
Spreadsheet model, as described above.
R script. that generates the visualisation. Graphics generated in this script. can but are
not required to be used in the report.
Report detailing your ndings and explaining any special steps you have taken that
are hard to present/explain elsewhere. The report should be concise and should not
exceed 5 pages (including graphics). It should be aimed at someone who understands
the problem and is familiar with R and Excel.
5 Marking
The assignment will be marked out of 100, of which 25 marks will be given for the report, 60
marks are allocated to the spreadsheet and R script, as indicated by brackets after each item
above. The remaining 15 marks are awarded for the following:Clarity;Presentation;
Initiative/creativity: these marks are for including functionality that goes beyond the
speci cation.