THE UNIVERSITY OF MELBOURNE
Centre for Actuarial Studies. Department of Economics
ACTL30004 Actuarial Statistics
Assignment, COVER SHEET
Due by 12:00 PM on Friday 25 October 2019. Submission via LMS.
This assignment contributes 10% of the total university assessment of this subject.
Please attach this cover sheet on top of your answers to your submission. Include name
and student ID number of all the students in the group. Write also your group number.
Declaration by Group
We declare that this assignment is our own work and does not involve plagiarism or
collusion. We understand that penalties will be imposed if the instructions accompanying
the assignment are not followed.
Student Number Name in full Signature Group Number
Plagiarism and Collusion
Plagiarism is the presentation by a student of an assignment which has been copied in whole or
in part from another students work, or from any other source without due acknowledgement.
Collusion is the presentation by a student of an assignment as his or her own which is the result,
in whole or in part, of unauthorised collaboration, with another person(s). Allowing your work
to be seen or used by other students outside of your group is also collusion, as is any form of
discussion, before submission, with any other student outside your group. A student who assists
another student outside their group in any way is also colluding.
Page 1 of 6
ACTL30004 Actuarial Statistics
Assignment, 2019
Instructions:
1. Complete the ACTL30004 assignment cover sheet and include it in your submission.
2. Write your answers to the questions below. You must show full working in each
question.
3. Up to five marks can be deducted if your solutions are poorly presented.
4. You may include part of your spreadsheets and/or code in your submission. You
should submit sufficient working so that the process you have followed in each question can be followed.
5. Your submission should be no longer than 15 pages including Appendix and cover
sheet.
You are reminded that heavy penalties apply to students who plagiarise or collude. These
terms are defined on the assignment cover sheet.
Install the R package CASdatasets in your computer. This package includes a
collection of datasets, originally for the book “Computational Actuarial Science with R”
edited by Arthur Charpentier (CAS with R). The package contains a large variety of
actuarial datasets. It can be downloaded from the website:
http://dutangc.free.fr/pub/RRepos/web/CASdatasets-index.html.
Third party insurance is a compulsory insurance for vehicle owners in Australia. It insures
vehicle owners against injury caused to other drivers, passengers or pedestrians, as a result
of an accident. Download the dataset ausprivauto0405. This dataset is based on oneyear vehicle insurance policies taken out in 2004 or 2005. There are 67,856 policies, of
which 4,624 (6.8%) had at least one claim. First, let us consider the variable ClaimNb,
which is the number of claims made in this period.
(a) Fit a Poisson, negative binomial and zero inflated poisson distribution to the empirical distribution of this discrete variable. Select the best fitting model based on
at least two model selection criteria.
Now consider the following explanatory variables: the variable vehValue represents the
vehicle value in $10,000s. In addition, you must create and Intercept column and two
indicator variables: the age of the vehicle, vehAge (1 = old cars or oldest cars, 0 =
otherwise) and the age of the driver drivAge (1 = old people, older work. people and
oldest people, 0 = otherwise). Also Exposure refers to the time exposed to risk for each
Page 2 of 6
policyholder during this period of time. Detailed information about this dataset can be
found in the book:
• De Jong, Piet, and Heller Gillian H. 2008. Generalized linear models for insurance data.
International series on actuarial science. Cambridge. Cambridge University Press.
(b) By justifying the use of the logarithmic link function fit a Poisson GLM to this
dataset to explain the number of claims in terms of the covariates. Include an
intercept, linear term for the vehicle value, indicator for the age of the vehicle and
age of the driver. Also include the exposure term in the linear predictor. Use also
suitable starting values. Give some comments on the parameter estimates. Is it
justified the use of the Poisson GLM to explain this set of data?
(c) For the negative binomial model show that V ar(Y ) = r β (1 + β).
(d) By denoting µ = E(Y ), find a new parametrization of the pmf of the negative
binomial distribution in terms of f(y|r, µ). Show that this new parametrization can
be expressed in exponential family form. Interpret the new parameters.
(e) By choosing appropriate initial values, use the Fisher-Scoring algorithm to fit a
negative binomial GLM in the form found in part (d) to this set of data. Include
an intercept, linear term for the vehicle value, indicator for the age of the vehicle
and age of the driver. Also include the exposure term in the linear predictor. Use
the same link function as in part (b). You should provide the maximum likelihood
estimates and their corresponding standard errors for each iteration. Give also the
maximum value of the log-likelihood function. Stop iterating when the absolute
value of each component of the score vector is smaller than 1×10 10. Write out the
model fitted showing the estimated regression coefficients. Calculate the variance–
covariance matrix associated to the estimates.
(f) Using the estimates derived in (e), test the statistical significance of adding an
indicator variable for the age of the driver when an intercept, linear term for the
vehicle value, indicator for the age of the vehicle are already included in the model.
Conduct the test at the 5% significance level.
(g) Compare the fit of the Poisson GLM and negative binomial GLM in terms of two
measures of model selection.
(h) Derive the expressions of Pearson’s residuals and deviance residuals for both models.
Use graphical diagnostic tools for assessing accuracy in the fit of both models in
terms of these residuals (i.e. Quantile-Quantile (QQ) plots). Give comments about
these graphs.
Page 3 of 6
Pearson’s and deviance residuals are far from normality when the response variable is
discrete and includes a high number of zero responses and they fail to provide useful
information of the inadequacy of the model. For that reason, we consider the randomized
quantile residuals as defined in:
Dunn, P. and Smyth, G. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics, 5(3):236–244.
(i) Give the analytical expression of the ith randomized quantile residual for the negative binomial GLM.
Now, you must demonstrate the abilities of the negative binomial GLM to predict number
of claims out-of-sample with QQ plots. The data set with original response variables given
by y1, y2, . . . , yn 1, yn must be partitioned into two halves, i.e.
A = {(y1, ξ1, x10, x11, x12, x13),(y3, ξ3, x30, x31, x32, x33), . . . ,(yn 1, ξn 1, xn 10, xn 11, xn 12, xn 13)}
and B = {(y2, ξ2, x20, x21, x22, x23),(y4, ξ4, x40, x41, x42, x43), . . . ,(yn, ξn, xn0, xn1, xn2, xn3)};
here n = 67856, yi
is the response variable ClaimNb and ξi
, xi0, xi1, xi2 and xi3 for i = 1, . . . , n are the coefficients associated with the Exposure, Intercept, vehValue, vehAge
and drivAge respectively in the design matrix.
(j) Use the data set A is for fitting the models (training dataset) and then use the
dataset B (validation dataset) for graphing the QQ plots of the randomized quantile
residuals for the negative binomial GLM. Also sketch QQ plots of these residuals
for the complete dataset. Give comments about these graphs.
The variable ClaimAmount contains the sum of the claim payment for each policyholder
(0 if no claim). In the following, we only consider claims amount larger than zero.
(k) Plot the histogram of the empirical distribution of ClaimAmount for values of the
claims size less than $50,000. Then fit the lognormal and inverse gaussian distributions given to this set of data by the method of maximum likelihood. Superimpose
the graphs of their densities to the histogram of the empirical distribution. Give
comments about the application of the likelihood ratio test for these distributions.
(l) The Value–at–Risk (VaR) is a standard risk measure that it is used to calculate
exposure to risk. In general, the VaR is the amount of capital required to ensure
that the company does not become technically insolvent. The VaR of a random
variable X is the 100-pth percentile of the distribution of X. Calculate the VaR
at 90%, 95% and 99% security levels. Compare the models. Give some comments
about the results.
Page 4 of 5
(m) The Kolmogorov-Smirnov (K-S) test is useful in testing the null hypothesis H0
that a sample x comes from a probability distribution function F(x). The K-S
test rejects the null hypothesis if the maximum absolute difference between F(x)
and the empirical cumulative distribution function Fˆn(x) is large. Assume that
the parameters for each model are specified by the maximum likelihood estimates.
For each continuous model quote the null and alternative hypotheses. State and
calculate the value of the test statistic using the analytical expression for the test
statistic (i.e. do not use built-in functions), calculate the p-value and give the
conclusion of the test.
(n) The Anderson-Darling (A-D) test is a modification of K-S test. Comments on the
advantages of A-D test over K-S test. Assume that the parameters for each model
are again specified by the maximum likelihood estimates. For each continuous model
quote the null and alternative hypotheses. State and calculate the value of the test
statistic using the analytical expression for the test statistic (i.e. do not use built-in
functions), calculate the p-value1 and give the conclusion of the test.
Page 5 of 6
1For the calculation of the p-value, use Monte Carlo simulation, under the null hypothesis, one simulation involves first simulating 4,623 observations from each model (e.g. sample size) to calculate the
K-S test statistics. Then, use 10,000 simulations to estimate the p-value. The estimated p-value is the
proportion of simulations for which the test exceeds the K-S test statistic. and give the conclusion of the