讲解 STAT GR5241 Spring 2023 Final讲解 Python语言

STAT GR5241 Spring 2023 Final

May 5, 2023

Students are allowed two pages of handwritten notes (front and back). Blank pages at the end can be used for scrap work. Calculators are allowed and will be useful. Tablets, phones, computers and other equivalent forms of technology are strictly prohibited. Students are not allowed to communicate with their neighbors during the exam. Please have your student ID available. Cheating will result in a score of zero. Good luck!

Section

Problem

Points

Score

Section I: Decision Trees and Boosting

AdaBoost

P1.i

P1.ii

Binary Decision Tree

P2.i

P2.ii

P2.iii

P2.iv

P2.v

Section II: Neural Networks

Fully-connected Neural Networks

P3.i

P3.ii

P3.iii

Section III: Support Vector Machine

Support Vector Machine

P4.i

P4.ii

Section IV: Clustering

Hierarchical Clustering

P5.i

P5.ii

P5.iii

P5.iv

Section V: Theory
Penalized Logistic Regression	P6	10
EM Algorithm and Clustering	P7	10
Section VI: Quiz
TRUE/FALSE	P8	10

Total

100

Problem 1: AdaBoost [15 points]

In this problem we perform. the adaboost classification algorithm described in the lecture notes and homework 5. The dataset of interest consists of n = 100 training cases, p = 2 continuous features x1 , x2 , and dichotomous response Y labeled Y = −1 and Y = 1. The training data is displayed in Figure 1, which shows x2 versus x1 split by Y. Figure 1 also displays the location of a single test case xtest and training case x17 .

Figure 1: Problem 1 training data

The weak learner is a decision stump, i.e., for the bth boosting iteration:

The decision stump is trained by minimizing weighted 0-1 loss:

Our adaboost classifier is constructed from B = 8 weak learners. For boosting iterations b = 1, 2, . . . , 8, the trained decision stumps c1 , c2 ,..., c8 are:

The weighted errors ϵb are:

Solve problems (1.i)-(1.ii):

1.i [10 points] Classify the test case xtest = (0.600 0.395, based on the trained adaboost model. Use all B = 8 weak learners to estimate Y(ˆ)test. Show the details of your calculation for full credit.

1.ii [5 points] Does the structure of the training data (Figure 1) provide an advantage or disadvantage when using adaboost with decision stumps? Describe your answer in a few sentences.

Problem 2: Binary Decision Tree [20 points]

Consider the decision tree shown below in Figure 2 for a binary classification problem. Assume the classes are denoted as + and -, respectively. Suppose there are four binary attributes in the data, A, B , C, and D. The counts shown in the leaf nodes of the tree correspond to the number of training examples assigned to the nodes. Assume that the decision tree classifier assigns the majority class of training examples as the class label of each leaf node.

Figure 2: Unpruned decision tree for Problem 2

Solve problems (2.i)-(2.v):

2.i [3 points] Calculate the training error rate of the decision tree shown in Figure 2.

2.ii [5 points] Calculate the generalization error rate of the decision tree using the validation set given below in Figure 3. Note that the wildcard * shown in the table means the value could be either 0 or 1.

Figure 3: Validation set for Problem 2

2.iii [6 points] Calculate the training and generalization error estimate of the pruned decision tree shown in Figure 4.

Figure 4: Pruned decision tree for Problem 2

2.iv [3 points] Based on your estimate of generalization error, which tree should be preferred (the unpruned tree given in Figure 2 or the pruned tree given in Figure 4). Justify your choice.

2.v [3 points] Draw the confusion matrix for the tree on the training data. A confusion matrix is a table that summarizes the number of examples correctly or incorrectly predicted by the model. For example:

Figure 5: An example confusion matrix for Problem 2

In the above table, n+− is the number of positive examples incorrectly predicted as negative class

Problem 3: Fully-connected Neural Networks [20 points]

In this problem we fit a neural network to perform classification on the same n = 100 training data from Problem 1. The dichotomous response Y is labeled Y = 0 and Y = 1, instead of −1 and 1. Our neural network consists of a single hidden layer with d = 2 derived features, sigmoid activation function and softmax output function. The schematic of this model is displayed below:

The objective or total cost Q(θ) is cross-entropy:

The neural network is trained by minimizing Q(θ) with respect to θ, where parameter θ is the collection of weights and biases W[1] , b[1] , W [2] and b[2] . The quantity Qi (θ) represents the ith training case’scontribu- tion to total cross-entropy Q(θ). The neural network is trained via gradient descent, yielding the estimated weights:

Solve problems (3.i)-(3.iii):

The following problems can be simplified using matrix multiplication. Show the details of your calcu- lation for full credit.

3.i [5 points] Classify the test case xtest = (0.600 0.395, based on the trained neural network.

3.ii [5 points] The 17th case of the training data is x17 = (0.4474 0.8764, with label y17 = 0. Compute the cost of case 17, i.e., compute Q17 (θ(ˆ)).

3.iii [10 points] Note that the gradient of Q(θ) can be expressed

Use the backpropagation algorithm to compute ΔQ17 (θ(ˆ)). This quantity represents the 17th case’s

contribution to the full gradient of Q(θ), evaluated at the point θ = θ(ˆ) . Show the details of your calculation for full credit.