Homework 10

1. • Required: View https://www.youtube.com/watch?v=k3AiUhwHQ28. This is a lecture

by Suvrit Sra. He is guest lecturing in Gil Strang’s MIT course on computational

methods in machine learning. View the whole lecture as it connects to many themes we

have been discussing.

• Optional: Section 3.1, 3.2 of Sauer discuss polynomial interpolation. Section 3.4 discusses spline interpolation.

2. This problem provides an example of how interpolation can be used. The attached spreadsheet

provides life expectancy data for the US population. The second column gives the probability

of death for the given age. So, for example, the probability that a person between the ages

of 20 and 21 dies is 0.000894.

Suppose a 40 year old decides to buy life insurance. The 40 year old will make monthly

payments of $200 every month until death. In this problem we will consider the worth of

these payments, a quantity of interest to the insurance company. The payoff upon death will

not be considered in this problem. If we assume (continuous time) interest rates of 5% and

let m be the number of months past age 40 that the person lives, then the present value of

the payments (how much future payments are worth in today’s dollars) is,

Our goal is to determine the average of PV, in other words E[PV]. For the insurance company,

this is one way to measure the revenue brought in by the policy. The difficulty is that our data

is yearly, while payments are made monthly and people do not always die at the beginning

of the month.

(a) Let L(t) be the probability the 40 year old lives past the age 40+t where t is any positive

real number. Estimate L(t) by first considering t = 0, 1, 2, . . . . These values of L(t) can

be computed using the spreadsheet data. (For example, for the 40 year old to live to 42,

they must not die between the ages 40 0 41 and 41 ∞ 42). For other t values, interpolate

using a cubic spline. In R you can use the spline and splinefun commands to construct

cubic splines, see the help documentation. Graph the interpolating cubic spline of L(t)

and include the datapoints, i.e. L(t) for t = 0, 1, . . . ..

(b) Explain why the expected (average) present value of the payments is given by

In practice we can’t sum to ∞, choose an appropriate cutoff and calculate E[PV].

3. Consider the MNIST dataset from homework 6. Recall, in that homework, we used a logistic

regression in 784 dimensions to build a classifier for the number 3. Here, we will use PCA to

visualize and dimensionaly reduce the dataset.

(a) In order to visualize the dataset, apply a two-dimensional PCA to the dataset and

plot the coeffecients for the first two principle components. Use orthogonalized power

iteration to compute the two principle components yourself. (Don’t forget to subtract

off the mean!) Color the points according to the number represented by the image in

the sample, i.e. the value given in the first column of mtrain.csv. (You can use the

first 1000 rows since plotting 60, 000 points takes a while.)

(b) Apply the PCA to reduce the dimensionality of the dataset from 784 to a dimension k.

(Don’t forget to subtract off the mean!) For some different values of k, do the following

i. Determine the fraction of the total variance captured by the k-dimensional PCA.

ii. In the file mnist_intro the function show_image displays the image given a

vector of pixels. (For example, if the vector v contains the 784 pixels of a particular

image, then show_image(v) will display the image.) For each value of k, compute

the projection of the image (i.e. 784 dimensional vector) onto the principle components. (The projected image will still be a 784 dimensional vector, but it will have

k pca coefficients; one for each principle component.) Then use show_image(v)

to compare the original image to the projected image. For what k, can you begin

to discern the number in the projected image?

What value of k do you think captures the dataset well?

(c) Given your results in (b), choose a dimension k and reduce the dataset from 784 dimensions to k dimensions. Then, build a classifier based on the k dimensional dataset.

Fit the logistic regression to the whole dataset using a stochastic gradient approach as

discussed in the YouTube video (mentioned above). Use the mtest.csv dataset to

test the accuracy of your dataset. Comment on the time needed to compute the logistic

regression and its accuracy relative to what you found in hw 6.

联系我们