Homework 11
MATH 504
1. • Reading from Elements of Statistical Learning (available in Course Documents)
– Sec 5.1, 5.2 discuss spline regression and the general idea of using basis functions to calculate regressions. I followed roughly the notations
and development of these sections in the lecture.
• Reading from Sauer
– Sections 5.1.1, 5.1.2. Differentiation using finite difference approximations.
– Sections 5.2.1, 5.2.2, 5.2.3. Integration using trapezoid rule and simpson’s rule. Newton-Cotes formulas are generalizations of Riemann,
trapezoid integration using higher order polynomials.
• Here is a nice webpage on spline regression in R, see section 4.5 therein.
http://data.princeton.edu/R/linearModels.html
The main point is that the R function lm is used in conjunction with
an R function, e.g. bs, that selects a spline basis and then computes
the associated model/design matrix. In otherwords, spline regression is
viewed as a linear regression as we discussed in class and this view is
made explicit by using lm.
2. The file BoneMassData.txt contains bone mass data for men and women at
a variety of ages. This data comes from the book, ”The Elements of Statistical
Learning”. In this problem, we will apply a spline regression to the dataset
using predetermined knots. Specifically our regression model will be y ∼ S(x)
were x is the age, y is the bone mass, and S(x) is a cubic spline. For brevity
consider only the samples taken from women and for simplicity consider two
knots with ζ1 = 15 and ζ2 = 20 (the case of more knots is no different). Then
our splines are determined by the parameters ai
with the requirement that S(x), S0
(x), S00(x) be continuous at the two knots.
Our goal is to find S(x) that minimizes the sum of squared residuals
Show that the spline S(x) that minimizes the sum of squared residuals
is given by α defined by α = (BT B)
)1BT y where y is the vector of yi
values and B is a N × D matrix given by Bk` = h`(xk). (We did this in
class, I want you to go through the details.)
(b) Now show that D from (a) has value D = 6. (Again, we did this in class.)
(c) Show that the six functions h1(x) = 1, h2(x) = x, h3(x) = x
+ form a basis for all splines with knots
at ζ = 15, 20. To do this you must show (i) each hi(x) is a spline for
the two knots, (ii) the hi(x) cannot be linearly combined to give the
zero function, and (iii) each spline must be a linear combination of these
functions. (Hint: show (i) and (ii) and then use the dimensionality of the
spline space to show (iii))
(d) Now compute the regression spline S(x) and plot it along with the data
points to show the fit. (Typically it is better to do the actual fitting
using a call to lm or using a spline regression package, but it’s good to
go through the process yourself at least once.)
(0) = 1. Consider the following two finite
For h = 10i with i = =20, ;19, ;18, . . . , ;1, 0 calculate both finite differences.
For each h, determine how many digits in the finite difference estimate are correct (you know the true value of the derivative is 1). Note that .99991 is correct
up to 4 digits since .999999 · · · = 1. Explain your results given finite differences
and floating point error. DON’T FORGET TO SET options(digits=16).
F(x) is an important function - essentially the cdf of a normal random variable
- that has no analytic formula and must be evaluated numerically. Write a
function, Fapprox(n, method) that approximates F(∞) by numerical integration. If method is set to the character string ”reimann” or ”trapezoid” use a
Riemann sum and trapezoid rule method, respectively, with n grid points. If
method is set to ”useR”, use the R integrate function with subdivisions set
to n. (R’s integrate function uses an adaptive grid method. In these methods, the grid is made dense in regions where the function varies, see Sauer for
further details on such methods.) Since you cannot integrate to ∞, you must
pick some reasonable cutoff. We know that the true value is F(∞) = 1/2.
Consider approximating the integral using each of the three methods, tryn = 10, 100, 1000, 10000 and compare accuracy.