辅导R、R编程解析、辅导Statistics for Risk Modeling

Math 490: Statistics for Risk Modeling
Name:
February 6, 2018 Midterm 1
Deadline: February 20, 2018 Teacher
1. This is a take-home midterm.
2. All answers should be written using a Latex editor and submitted in Compass as a PDF
document.
3. Coding should be done using the statistical software R.
4. Clearly indicate the di erent questions.
5. Clearly explain your solutions. Any R code you use should be added to the report as an
appendix.
6. Upload a separate R script, with the R code used in your report. Make sure we can run the
R code line by line to create the numbers, tables and gures in your report.
7. It is allowed to discuss the questions with your colleagues, but the report and the R code
should be an individual work. The report should contain su cient elements which make
the report unique. Note that plain copying R code, text or ideas without adding your own
interpretation is considered as cheating and may lead to a reduced grade or even a zero.
8. GOOD LUCK!
c University of Illinois at Urbana-Champaign, Department of Mathematics
Question Points Max
Question 1 10
Question 2 15
Question 3 15
Layout of
the report
5
Writing
style
10
R coding 5
Total 60
c University of Illinois at Urbana-Champaign, Department of Mathematics
1. Consider the data set containing observations (xi;yi), i = 1;2;:::;n. Prove the following
relation:
R2 = r2;
where R2 is given by:
R2 = Regression SSTotal SS ;
and r is the Pearson correlation between x and y.
Keep the following in mind when you make your solution:
avoid using only formulas or using long derivations.
Add su cient text in between your formulas to clarify what you are doing.
You can use formulas from the lecture slides which you need to use in your solution.
Make the solution as much as possible self-contained: start with the model and its as-
sumptions, refer the formulas we already proved in the lecture and then derive the result.
Grades will be rescaled to 10 (without rounding, without curving) and reported via Com-
pass.
c University of Illinois at Urbana-Champaign, Department of Mathematics
2. In this question you have to use the data set DataQuestion2.csv. This data set contains 130
independent observations. Each observation represents a claim amount paid by a car insurer.
We assume that each claim is an independent realization from a random variable X.
(a) Determine the mean, standard deviation, minimum and maximum of the data set.
(b) Transform. the claim sizes by taking the natural log of each claim size. Denote these log
claim sizes by y.
(c) Make a histogram and estimate the density of the log claim sizes using the empirical
density ^f. You will have to decide how to choose the grid for the y values. Explain your
choice.
(d) Use the function density to construct a smooth density function fs of the log claims. Make
a plot of fnormal, ^f and fs. Here, fnormal denotes the density of a normal distribution
with mean and standard deviation . Comment on your ndings.
If you type ?density(), you nd information about the function.
(e) Is it reasonable to assume that the log claims are normal distributed? Hence:
logX d=N ; 2 ;
with appropriate choices for and ? Explain your answer. You can add additional tests
to defend or to reject the hypothesis that the data is normal.
(f) Describe the potential risks for the insurance company if a log normal distribution is
employed to describe future claims.
c University of Illinois at Urbana-Champaign, Department of Mathematics
3. For this question you will use the data set DataQuestion3.csv. This data set contains the sales
of Lottery tickets in di erent towns. The rst column of the data set is the population size in
each town. The second column corresponds with the total sales in each town.
(a) Use a linear regression model to investigate the e ect of the population on the sales.
Determine the parameters of the model and comment on the quality of the t. Add
additional plots and tables to support your ndings.
(b) Consider a new observation where x = 10000. Determine ^y.
(c) The random variable ^Y is de ned as:
^Y = ^ 0 + ^ 1x+ ^ :
Give an interpretation of the random variable ^Y. What is the di erence between Y and
^Y?
(d) Use simulation1 to determine a set of realizations from ^Y. Determine the mean of ^Y and
construct a histogram. Compare this histogram with a histogram of the response variables
yi of the data set. Give an interpretation of the results.
(e) Determine a con dence bound for the predicted value ^y.
(f) Make a residual plot and comment on your ndings.
(g) Do you think using this regression model for estimating future sales is a good idea?
1You can use rnorm to simulate from a normal distribution
c University of Illinois at Urbana-Champaign, Department of Mathematics