讲解 Introduction to Machine Learning, COMP0088 (A7P) Practice Paper, 2023/24辅导 Matlab编程

Introduction to Machine Learning, COMP0088 (A7P)

Practice Paper, 2023/24

Suitable for Cohorts: 2021/22, 2022/23, 2023/24

This paper consists of SIX questions. Answer ALL SIX questions.

Marks are distributed as follows:

• Q1: 24 marks

• Q2: 12 marks

• Q3: 12 marks

• Q4: 12 marks

• Q5: 16 marks

• Q6: 24 marks

Answers may be handwritten or typed. Diagrams and plots may be sketched by hand or generated computationally. You may use software such as NumPy, Matlab or R for linear algebraic calculations, getting PDF values etc, but you must explain what you are doing using

text, mathematics and diagrams. Do not include code in your answers.

Include all relevant working and/or reasoning for your answers—unjustiﬁed answers will receive no marks even if they are numerically correct. Keep answers concise and to the point. Prioritise clarity—answers that are illegible or incomprehensible will also receive no marks.

Marks for each part of each question are indicated in square brackets.

Standard calculators are permitted.

Table 1: XOR training data for Question 1

1. Table 1 shows inputs and outputs for the logical XOR operation. Consider this as training data for a supervised learning model with d = 2 input features x1 and x2 , and a single binary output,y.

a. Consider using the perceptron algorithm to learn a linear decision boundary of the form.

where weights vector w ∈ Rd+1 is to be learned and the input vectors x ∈ Rd are augmented with a dummy feature x0 = 1 to capture an intercept with weights element w0 .

Show that the algorithm will not converge for the training data in Table 1. [6 marks]

b. One way to address the problem in part a is to apply some transformation of the input features into a new basis, φ : x !' x\ , and ﬁnd the boundary for x\ .

i. Suggest a suitable transformation φ such that the problem becomes solvable. [3 marks]

ii. Sketch the effective decision boundary produced under this transformation as it would appear in the original feature space. [3 marks]

Figure 1: Simple multilayer perceptron

c. An alternative to basis expansion is to use a more complex model to ﬁt the data. Figure 1 shows one such model, a simple multi-layer perceptron with a single hidden layer of two neurons. As in part a, the input is augmented with a dummy feature x0 = 1 to capture bias. The hidden neurons apply ReLU activation to their outputs. The output layer applies a similar decision function as the linear model in part a, i.e. it will classify a positive output as class 1 and zero or negative output as class 0. Note that there is no bias term at the output layer.

i. Find weights matrices W1 2 R3×2 and W2 2 R2×1 such that this model sepa-rates the Table 1 data. [9 marks]

ii. Sketch the resulting decision boundary. [3 marks]

[Total for Question 1: 24 marks]

Figure 2: Training data for Question 2

2. a. Panel a of Figure 2 shows a training set for a binary classiﬁcation problem to be ﬁtted by a linear support vector machine.

(You may use any convenient method to reproduce the ﬁgure for your answers: print, photograph, screen capture or sketch it by hand.)

i. Sketch the decision boundary found by a hard margin linear SVM and identify the support vectors. [2 marks]

ii. Sketch a decision boundary that might be found by a soft margin linear SVM with a reasonable value for the cost parameter C, and identify the support vectors. Brieﬂy explain how you have interpreted ‘reasonable’ . [2 marks]

iii. How would removal of the point marked with an arrow affect the ﬁtin each case? [2 marks]

b. The dataset in panel b of Figure 2 deﬁnes three classes, but SVMs are inherently binary classiﬁers.

i. Brieﬂy explain two ways in which SVMs could be used to classify this data. Which of these approaches would you expect to be more efﬁcient in this case? [3 marks]

ii. Sketch boundaries for SVMs trained with this more efﬁcient method. (Again, you may use any convenient method to reproduce the ﬁgure for your answer.) [2 marks]

iii. Estimate how the point marked with a purple cross would be classiﬁed under this scheme. [1 marks]

[Total for Question 2: 12 marks]

Table 2: Unlabelled sample data for Question 3

3. Table 2 shows a small dataset of unlabelled data with a single feature dimension.

a. Assign this data to two clusters using the k-Means algorithm.

i. Choose initial cluster centroid values, c1 and c2. Brieﬂy justify your choice. [1 marks]

ii. Iterate k-Means to convergence and give the ﬁnal centroid estimates. [2 marks]

b. Do a single ﬁtting iteration of a two-component Gaussian mixture model.

i. Use the results from part a to initialise the model. Use the centroid values c1 ; c2 as the initial means μ1 ; μ2. Choose initial values for the component probabilities α 1 ; α2 and variances σ1(2); σ2(2) in some way. Explain your choices. [2 marks]

ii. Estimate the responsibilities i;j of each component j for each sample i. Re- member to show your working. [4 marks]

iii. Use the responsibilities to update your estimates of μ1 ; μ2 ; σ 1(2); σ2(2); α 1 and α2 . [3 marks]

[Total for Question 3: 12 marks]

Table 3: Training data for Question 4.

4. Table 3 lists a simple dataset with two numeric input features, x1 and x2 , and categorical class labels,y. You wish to train a decision stump classiﬁer on this data.

a. Calculate the misclassiﬁcation error, cross entropy and Gini impurity for the complete data, with no split. (Use natural, base e logarithms for the entropy calcu- lation.) [1 marks]

b. The decision stump will split the data by a simple inequality test on a single feature value. Identify every possible such split, and for each one evaluate the total loss according to the three metrics above. [7 marks]

c. Choose an optimal split point and comment brieﬂy on your choice. [2 marks]

d. What class would be assigned to the point (4; 2) according to this split? [1 marks]

e. How, if at all, would your split choice have been affected if a minimum node size of 3 was imposed? [1 marks]

[Total for Question 4: 12 marks]

Table 4: 1D training data for Question 5

5. Table 4 gives a very simple 1D data set, with a single input feature y sampled at unit intervals and corresponding output labels g. Consider this as training data for a regression model that includes an intercept term.

a. Manually ﬁt the training data using the ordinary least squares procedure.

i. Augment the data with a dummy intercept feature x0 = 1 and construct the design matrix X. [1 marks]

ii. Use your design matrix to write down a pair of simultaneous equations in w = [w0, w1] that deﬁne the least squares ﬁt. [2 marks]

iii. Solve these equations and use the results to sketch the OLS regression line. [2 marks]

b. It is possible that the unregularised least squares in the previous part may have over- ﬁtted. Try adding L2 regularisation to the ﬁt.

i. Adapt the previous simultaneous equations to perform. ridge regression on the same data, with a regularisation coefﬁcient λ = 1. [2 marks]

ii. Solve these equations and sketch the new regression line. [2 marks]

c. It is usually recommended not to regularise the intercept term in linear and other models. Here you will investigate the effects of doing so.

i. Construct a new set of labels,˜(y) = y + 100. [1 marks]

ii. Deﬁne and solve two pairs of simultaneous equations for this new regression problem. In one set, apply the ridge penalty to both w0 and w1 . In the other apply it only to w1 . [4 marks]

iii. Sketch both ﬁt lines and comment on the difference. [2 marks]

[Total for Question 5: 16 marks]

Figure 3: Operation graph for Question 6

6. Figure 3 shows an operation graph representing the following function:

a. Calculate the gradients of this operation graph via backpropagation, ﬁlling in the appropriate boxes on the ﬁgure.

(You may use any convenient method to reproduce the ﬁgure for your answers: print, photograph, screen capture or sketch it by hand.)

i. Perform a forward pass through the graph given the input z = 1. Fill in the values propagated forward at each point in the shaded boxes above the arrows. [5 marks]

ii. Calculate the local gradients at each node with respect to its inputs. Fill these values in the boxes underneath each node. (Note that there are two boxes under the multiplication node, corresponding to its two inputs.) [5 marks]

[Question 6 cont. on next page]

iii. The downstream loss gradient at the output of this graph is given to be 1. Do a backward pass through the network to calculate the loss gradients at each point, ﬁlling in the values in the boxes underneath the arrows.

Hint: when there are multiple outputs from a node, the overall downstream gradient for that node is the sum of the gradients returning via each path. [5 marks]

iv. Expand and differentiate the algebraic version of the function given above, and conﬁrm that your backpropagated gradient for the whole expression is correct. [1 marks]

b. Manually minimise the above function by gradient descent.

i. Starting from z = 1 and with a learning rate of 0.1, estimate the minimum value of this function for 5 iterations. How close is your estimate to the true value? [5 marks]

ii. What happens if the learning rate is instead 12/1? [1 marks]

iii. What happens if the learning rate is instead 6/1? [1 marks]

iv. What happens if the learning rate is instead > 6/1? [1 marks]

[Total for Question 6: 24 marks]

联系我们