辅导Software、讲解R程序设计、辅导R语言、辅导data 讲解R语言编程|解析R语言编程

Ludwig-Maximilians-Universität München – Institute of Statistics
Take-home exam for the lecture Statistische Software (R)
Summer Term 2020
Notes:
1. The exam problems are provided in English. You are free to hand in verbal solutions in English and
German. However, you need to stay consistent within one problem solution. For language clarification
use the Moodle forum.
2. The take-home exam features 7 problems. There are 90 marks to be achieved via solving all 7 problems.
Additionally, you can receive up to 10 marks for compliance to the style guide of Hadley Wickham. All
problems need to be solved either using R code or with verbal solutions.
3. By submitting the exam you register for the exam at the same time.
4. Teamwork is prohibited. Your work must be your own. We expect you to attach a signed declaration of
originality to the exam. If there is any suspicion of cheating, we will report any associated person to
the examination board (Prüfungsamt für Statistik). This may ultimately result in failing the exam
and/or additional disciplinary measures.
5. The exam submission will take place via the Moodle page of the lecture. We expect from you to hand
in a ZIP-file containing the following:
- an .Rmd file with your solutions. Your solutions will be both, text and code answers.
Use the template on the Moodle page.
- the compiled version (PDF) of your .Rmd
- all data sets used in the .Rmd to make sure that your code actually runs.
Please only supply external data sets that are not replicable with R code.
- a declaration of originality (PDF).
When one of the items is missing, the exam will be failed. This also holds if the .Rmd cannot be compiled.
6. Enter your complete name and your student ID into the header of the supplied template. The template
only covers the first few problems. You need to continue it individually.
7. For your files, use the following naming scheme: statsoft_firstname_lastname.Rmd, statsoft_firstname_lastname.pdf,
statsoft_firstname_lastname_declaration.pdf. Name your ZIP
file statsoft_firstname_lastname.zip. The naming of the data sets need to be consistent with your R
code. When reading an external data set (typically .csv), you have to assume that the data set is in
your current directory. This means that you need to read it via, e.g. . read_csv("data.csv").
8. The exam period will be from 05.08.2020 to 02.09.2020.
9. Terminal deadline: 02.09.2020 - 18:00 CEST (Central European Summer Time). Any exams provided
after the deadline will not be considered. This means they will also not be counted as an attempt.
10. You are only allowed to use the base packages of R unless the problem explicitly asks for it. The base
packages are:
1
(.packages())
## [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [7] "base"
11. Make sure that in the PDF output all of your solutions are displayed (echo = TRUE).
12. Limit yourself to only providing the requested solution. If you supply unnecessary code and answers,
this may result in mark deduction.
13. Make sure that your code is easily correctable, precise, efficient, and easily readable and understandable.
14. Make use of comments in your code if necessary. Document your functions when requested. Be precise
and short in answer and code.
15. By submitting the exam you consent that we may electronically check your exam on plagiarism.
16. Follow the instructions on the Moodle page what needs to be in your declaration of originality.
17. Questions on the exam are answered in the Forum only to ensure that all students receive the same
information. You find the Forum on top of the Moodle page of the lecture.
2
Problem 1 – Tidyverse (25P)
This problem is a straight-forward data science problem solved by using R effectively. It is not too complex
but most likely something you will see over and over again when working with data. Use the tidyverse
package(s) for this task.
a) Download the following dataset from openml https://www.openml.org/d/187 as a .csv and read it
into R such that it is a tibble. Create a new column wine_type from the existing column class with
3 levels: Let 1 denote “red_wine”, 2 “white_wine” and 3 “other_wines”. Create boxplots using ggplot
of the alcohol content for the three different wine types using your new column wine_type. Rearrange
the data set so that wine_type is the first column. (3P)
b) You want to investigate a few key features of each wine type. Create a tibble with the name
wine_summary that contains the min and max alcohol content, the mean of Flavanoids, and the median
of color_intensity. Which wine type contains the wine with the maximum alcohol content? (1P)
c) Look for the three wines with the minimum alcohol content in each group of wines in the original wine
tibble from a). Report on their magnesium and Total_phenols. Make sure that for each wine_type
you only select the row from the wine tibble that contains the wine with the minimum alcohol content
from its according group. (2P)
d) Is the wine data tidy? Explain your answer. (1P)
e) From a different data source, you know that wines with flavanoids between and equal to 1 and 2 are
from France and wines with flavanoids greater than 3 and smaller than 1 are from Germany. Create a
new column containing the regions. Create another column that contains 1 if the region is Germany
and 0 if the region is France. How many wines are from each region? How many NAs are there? Where
are the most white wines from? (3P)
f) Our information on the regions was insufficient as there are still NAs in the data. Fit a linear model on
the part of the wine data without NAs using the R function lm. As the dependent variable use one of
the region columns you have constructed in e) and as independent variables use Alcohol, Malic_acid,
and Ash. Use the linear model to predict the missing region for the other part of the wine data that
contains NAs.
Hints: Make sure that your dependent variable is numeric during fitting. Also make sure that the final
prediction fits the region variable (you may need some post processing after the model prediction). (5P)
If you did not succeed in predicting the missing values, you may use the RDS region_01.RDS on the Moodle
page for the consequent problems.
g) Combine the part of the wine data on which you fitted the model in f) and the part of the wine data
with the predicted regions back into one wine dataset. For this, all columns need to be identical in
both tibbles, so make sure they are. Check if there are really no NAs left. (4P)
h) Now that we have a region for each wine in the dataset, we want to look at some summary statistics for
each region. Create a new data set that contains the mean and median observation of each numeric
column in the wine data for each region and wine_type. Name the new columns such that they have
the suffix "_mean" and "_median". (3P)
i) Create a scatter plot using ggplot of Color_intensity on the x-axis and Alcohol on the y-axis from
your wine data set from g). For each wine_type, choose a different color of points. Add a smooth line
including its confidence interval that fits all the data points in the plot. Color the smooth line in black.
Add another three smooth lines for each wine_type and color them in the same color as the according
points. Do not display the confidence interval for these lines. How are Alcohol and Color_intensity
related to each other in this plot for the three wine types? (3P)
3
Problem 2 – Basic R (11P)
In this problem, we want you to demonstrate that you understood how R operates in the backend. Make sure
to invest enough time to actually give a good answer using the material from this lecture.
a) Explain the following outputs. Focus your answer on how a single function can be so flexible w.r.t. to
its input. Note: We only want a vague verbal explanation, technical details are far beyond what we
taught you. (1P)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
summary(lm(Petal.Width ~., data = iris))
##
## Call:
## lm(formula = Petal.Width ~ ., data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59239 -0.08288 -0.01349 0.08773 0.45239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.47314 0.17659 -2.679 0.00824 **
## Sepal.Length -0.09293 0.04458 -2.084 0.03889 *
## Sepal.Width 0.24220 0.04776 5.072 1.20e-06 ***
## Petal.Length 0.24220 0.04884 4.959 1.97e-06 ***
## Speciesversicolor 0.64811 0.12314 5.263 5.04e-07 ***
## Speciesvirginica 1.04637 0.16548 6.323 3.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1666 on 144 degrees of freedom
## Multiple R-squared: 0.9538, Adjusted R-squared: 0.9522
## F-statistic: 594.9 on 5 and 144 DF, p-value: < 2.2e-16
b) Explain the following outputs. Note: For this answer, it matters which R version yo have. We typically
expect you to have R >= 4.0.0. Thus, please indicate if you do not (e.g. via sessionInfo() at the
end of the document). (1P)
4
class(state.x77)
## [1] "matrix" "array"
typeof(state.x77)
## [1] "double"
class(as.data.frame(state.x77))
## [1] "data.frame"
typeof(as.data.frame(state.x77))
## [1] "list"
c) What do you find odd about the following outputs? Write code to make sure that the integer values
are identical to the factor values. (This means that the factors actually display ones and zeros.) (1P)
fct <- as.factor(c(0, 1, 0, 1, 1))
int <- as.integer(fct)
fct
## [1] 0 1 0 1 1
## Levels: 0 1
int
## [1] 1 2 1 2 2
d) Explain the following behaviour of R. Hint: infix. (1P)
`+`(3, 4)
## [1] 7
e) Based on d) write your own function infix which creates an integer sequence. The function should
take two inputs (from, to) and return an integer sequence. Name the function %to%. (2.5P)
3 %to% 11
## [1] 3 4 5 6 7 8 9 10 11
f) Define a positive semidefinte matrix in R. (1P)
g) Find an explanation for the following behaviour. (1P)
(0.1 + 0.2) == 0.3
## [1] FALSE
h) Compute the element-wise, matrix and cross product of the following matrices. Both, X and Y should
be on the right-hand and left-hand side of the operation if meaningful. Also, invert X. (2.5P)
X <- matrix(c(3, 4, 0, 5, 6, 1, 0, -1, 7), nrow = 3)
Y <- matrix(c(7, 44, 0, -5, 16, -1, -4, -1, 0), nrow = 3)
5
Problem 3 – Functions from your daily life (8P + 1P)
We want you to implement some functions that you may know from your daily life. If you need some
additional input, feel free to reference the internet. Please reference the URL which you used here. Wikipedia
will be accepted as a reference. We only expect you to write code that works for scalar input. If your code
accepts vectors (and of course works) you will receive an extra mark. That means, for example, if your
function computes your blood alcohol concentration, it must work for one individual, we however value if
you manage to return a vector of blood alcohol levels for multi-dimensional input (i.e. multiple persons).
Some of your functions will need only one input, some will need multiple ones. Make a short documentation
(description, arguments, output) of your function. (See a).) Disclaim the resource you used for the function
in the documentation. We are only interested in your resource if your solution is wrong. Then, however, we
will only accept freely accessible German or English resources. Give your functions proper names. Also, make
sure that variable names are somewhat meaningful. In general, we only expect you to supply reasonable
input. Thus, your functions do not need to perform input checking. That means it is okay if unreasonable
input results in unreasonable output – as long as your function is correct.
a) Temperature conversion: Write a function that converts degrees Fahrenheit into Celsius. Also, write a
function converting Celsius into Fahrenheit. (0.5P)
#' Temperature conversion from Fahrenheit to Celsius
#' This function converts degrees F to C using the formula provided here:
#' https://www.metric-conversions.org/temperature/fahrenheit-to-celsius.htm
#' Arguments:
#' temp: temperature in Fahrenheit. A numeric vector.
#' Returns: a numeric vector indicating the temperature in Celsius.
b) Body-mass-index: Write a function that computes the BMI. (0.5P)
c) Risk assessment: Write a function that computes the Sharpe ratio (a financial market metric). Hints:
Make sure that you cast percentages as floats. The risk-free rate is the same for all portfolios. (0.5P)
d) Energy conservation of kinetic energy: Write a function that predicts the speed of impact (either in
km/h or m/s) of a physical body when being dropped from any given height. Ignore frictions etc. just
like in high school physics. Hint: Energy is conserved! (0.5P)
e) Speeding fines: Write a function that computes your speeding fine (in Euros) with a car (PKW, no
trailer!) in Germany. Neglect the tolerance of 3kph which is typically applied. (This resource may be
helpful. We are only interested in the fine and not other penalties. Despite the current legal issues, use
the normal 2020 penalties. Reported speeds are always rounded.) (2.5P)
f) Tax in 2020: Write a function that computes the income tax to be paid by a single individual in
Germany in 2020. (This resource may be helpful. We are only interested in singles. Thus the zVe is
identical to the real income. Only rounded incomes are considered by the tax.) (1.5P)
g) Infant language: Write a function that adds syllables to an existing word (a character): Each vowel
should be replaced by an “ellu”, e.g. “alle” becomes “ellullellu”. You only need to consider lower case
input. (2P)
6
Problem 4 – Linear regression & simulation (23P)
A linear model is typically estimated by the least-squares method. Using least-squares the Normal equation
from which the estimator βˆ for the parameter vector β can be derived looks as follows:
XT Xβ = XT
y
You are interested in the association of children heights to other physical features. You observe the following
features:
• Bodyweight in KG
• Age in years
• The combined height of both biological parents in cm
• Sex (male / female)
We are going to work with simulated data in this problem. Note that the data is completely simulated and
hence there is no claim that the underlying relationships actually behave in the described manner. Make
yourself familiar with simulation functions in R as for example runif() and sample().
a) Find a way to make sure that your results are reproducible. Which function in R should always be used
before simulating data (as long as results are supposed to be reproducible)? We expect the associated
code which makes your code reproducible as your answer below. Note: Only make use of the base
package. (0.5P)
b) We assume that there is (little or) no correlation between age, sex, and the combined parents’ heights.
Simulate 250 observations for each of the three covariates. Store them in separate vectors with proper
names. Make use of meaningful rounding. Assume that age is uniformly distributed in our sample. The
minimum age is – obviously – 0 and children are excluded on the day of their 15th birthday. Sex is
Bernoulli distributed with a 0.5 probability of being male. The combined height of both biological
parents is the sum of two independent normally distributed variables. The male part can be described
by a mean of 177 cm and a standard deviation of 10 cm. The female part can be described by a mean
of 164 cm and a standard deviation of 9 cm. We expect the code which simulates all three vectors as
your answer below. Make sure to write your code for the vectors so that it is easy to interpret and
understandable. (1.5P)
c) Summarise your data very briefly. Make use of one meaningful descriptive function for each vector.
(1P)
d) Of course, the bodyweight will likely depend – causally – on the child’s sex and the child’s age. There
will also be a correlation between the bodyweight of a child and its parents’ combined height. However,
this will most likely be because both are equally connected to the child’s height. Thus, we neglect this
correlation for now. Simulate the vector weight assuming that bodyweight is normally distributed.
The mean depends on the sex and age. For boys the mean is described by 9 + 6 * log(age + 1) +
(age > 5) * (-4 + 2.5 * age) + 2 * (age > 11) * sqrt(max(age - 5, 0)). For girls the mean
is described by 6 + (age <= 5) * 5 * log(age + 1) + (age > 5) * (-3 + 3 * age) + (age >=
11) * sqrt(age). The variance for both sexes is described by 2 * log(0.2 * m) where m is the
respective mean for each sex. Explain the three functions mathematically and intuitively (i) and
simulate the weight for all 250 observations (ii). (2P)
e) The height of the children can be deterministically described by a linear function. This function has
an intercept of -80 cm. For each additional cm of combined parents’ height, a child is on average 0.4
cm taller. Boys are on average 4 cm taller and children grow on average 9 cm per year. Next to this
deterministic function, the child’s height is subject to an additive random error which is normally
distributed with zero mean and a standard deviation of 7 cm. Simulate the variable height. (1P)
7
If you did not succeed in simulating, you can use the RDS objects height, weight, sex, and age supplied
on the Moodle page instead for the consequent problems. You can also use these vectors for the previous
problems (e.g. e)) if you only failed in simulating some of the vectors.
f) Compute the correlation between height and weight. Also, compute a suitable measure of the
correlation between height and sex. (This might require some online search! Keyword: Point-biserial
correlation.) (4P)
g) Construct a data.frame from all vectors. Name it df. Explain the following output. (1P)
is.list(df)
## [1] TRUE
h) Construct a design matrix X from df. The design matrix is supposed to be used to model y, the
height. Name the matrix X. (1P)
i) Compute the right-hand side of the Normal Equation. Find a mathematical expression for β when
solving for it. Also compute β (or βˆ to be precise). Explain how β from this question is related to
question e). (3P)
j) Fit a linear model using df to explain height. Report your results and compare them to i). (1P)
k) Re-simulate your data set but now with 1000 observations. Again, fit a linear model and compare the
results to j). Hint: You are allowed to solve this problem together with l). (2P)
l) Use your work from k) to write a function that simulates this specific data situation. Your function
should have two inputs: n, the number of observations to be simulated, and seed, a seed for random
number generation. Your function should return a data frame with all simulated columns. Note: If
you did not succeed in simulating the data previously, we also accept pseudo-code. (2P)
m) A linear model typically models the mean of a normally distributed random variable (µi). In this case,
we model a link which is equal to the resulting mean (identity / linear link). However, imagine that a
linear model is used to model count data or the mean of a random variable from a Poisson distribution.
The mean of this variable is equal to its rate parameter (λi). We can simulate data following a Poisson
distribution like before in this problem. Here we use a log-link instead of an identity link. Explain why
the code below throws a message and produces wrong results. Correct it so that it works. To check
whether you did the right thing, you can compare your simulated values with the histogram shown
below. Note: This problem can be solved independently from the previous ones. (3P)
set.seed(11)
beta0 <- 1
beta1 <- 2
beta2 <- -0.25
x1 <- runif(100, -1, 2)
x2 <- runif(100, 3, 7)
link <- beta0 + beta1 * x1 + beta2 * x2
y <- rpois(100, link)
## Warning in rpois(100, link): NAs produced
8
Histogram of y
y
Frequency
0 10 20 30 40 50
0 20 40 60 80
Problem 5 – String wrangling (6P)
Regular expressions are often used in programming languages when working with text data. In R there are
two dominant ways to work with regex: The base and the stringr package. You are allowed to use both in
this problem Before you start working on this problem, make yourself familiar with regular expressions.
On the Moodle page, there is a newspaper article (text.RDS) which we aim to prepare for the use in a
machine learning algorithm. So far, the article is just as it was scraped from the web page.
a) The algorithm cannot deal with special characters except “.”, “,” and " ". Remove all other special
characters. Additionally, the text should be converted to lower cases only. (3P)
Your solution should look like this:
substr(text, 1, 50)
## [1] "der von der griechischen justiz verfolgte frühere "
b) Change the German Umlaute (e.g. ä) into the international equivalent (e.g. ä). (1P)
c) On Moodle, we also provide a vector (top50.Rds) with the 50 most frequently used words in German.
Remove all words occurring in this vector from the text. Only remove the words reported in the vector
and no variants. (As you may have seen the vector has more than 50 entries. We already added some
variations of the most frequent German words.) (2P)
9
Problem 6 – Gradient descent (11P)
You are hiking and you want to start your descent. You are hiking with your stats friends and one of them
mentions that descending a mountain can be seen – simplified – as if you follow the slope of a quadratic
function from your point into the valley.
Using base R/graphics one can visualise the description as follows:
grid <- seq(-4, 4, length.out = 100L)
altitude <- 0.3 * grid ^ 2
plot(grid, altitude, type = "l", xlim = c(-5, 5), ylim = c(0, 5),
xlab = "Distance in Km", ylab = "Altitude in Km")
points(grid[95], altitude[95], col = "blue")
Distance in Km
Altitude in Km
You as a hiker are the blue dot. A smart strategy would now be to go down the hill by its slope step by step.
The local slope changes. The local slope of a function is described by its derivative.
a) Compute the local slope for all values of grid. Visualise the local slope using graphics. (1P)
b) Define the quadratic function as a function in R itself. It should have one input only. (1P)
c) Also, define the (analytical) derivative as a function in R. (1P)
d) Go down the hill by one step. Your step length is exactly one meter. The scale of grid is in kilometers,
though. Report the new location. Plot your new position in the previous graph. Hint: Use your
functions from b) and c). (2P)
e) This is a very slow descent. You decide to increase your step length to 1.2 m. However, this is still very
slow coding (if we have to update every step manually). You can also simulate your descent using a
for-loop. Go down the hill 180 steps with a step length of 1.2 m. Where are you now? Plot your new
position, too. (1P)
f) You are close to the valley! You feel very confident by your descending strategy. Thus, you do 1000
steps in the same direction. This means you do not adapt your slope at every step anymore. Where are
you now? Plot your new position, too. Explain what happened. (1P)
g) As your last idea failed, you go back to the previous descending strategy and do 4500 steps adapting
for your slope again at every single one. Again, plot your new location. Now it seems that you are in
the valley. (Right?) (1P)
h) Rewrite the descending strategy as a function using either the repeat function or a do-while loop. As
seen in g) you will not be able to arrive in y == 0 (exactly). Make sure that your loop stops when you
are in the valley with some tolerance. Name the function descent_hill. Your tolerance should refer
to the slope – a slope sufficiently close to zero indicates that you are in the valley (as you are basically
10
not descending anymore.) We suggest tolerance of 0.01. Your function should return the location of
the valley (x and y) coordinates. Your function should accept the following inputs:
• Start point.
• Step size.
• The function defining the hill (here the quadratic function).
• Slope / derivative of this function.
• Tolerance in the valley. (3P)
Problem 7 – Outlier detection in the linear model (6P)
Now, we use a variation of the code of the end of problem 4 to model a normally distributed dependent
variable:
set.seed(11)
beta0 <- 1
beta1 <- 2
beta2 <- -0.25
x1 <- runif(100, -1, 2)
x2 <- runif(100, 3, 7)
y <- beta0 + beta1 * x1 + beta2 * x2 + rnorm(100, 0, 1)
hist(y)
Histogram of y
y
Frequency
−4 −2 0 2 4
0
5 10 15 20
a) Create a data.frame with all independent and dependent variables. Fit a linear model using all
covariates explaining y. (0.5P)
We are interested in finding the most influential or outlying observations. There are many different ways to
do this. Today we focus on two specific ones.
b) Fit 100 linear models leaving out one distinct observation in each. For each model compute the relative
change in the adjusted R-squared compared to the baseline model estimated in a). Remodel the linear
model leaving out the five observations which decrease adjusted R-squared the most. (3P)
c) Alternatively, identify the five observations (out of all 100 observations) with the largest residuals. Refit
the model leaving these five observations out. (1.5P)
d) Argue which procedure worked better in the underlying case. (1P)
11