Warning: Homeworks will not be graded if submitted after the deadline. For all problems, show
detailed reasoning.
0. (Reading assignment) DL Book Chapter 7
1. (Back-propagation for 100-layer network { 10 points) Implement the training
algorithm for Example 1 in page 42 in lecture notes #6. Assumem = 2 and x = y = ( 0:5;0:5)T,
i.e., we want to train a neural network to output 0:5 when the input is 0:5 and to output 0:5
when the input is 0:5. Assume l = 100, i.e., 100 layers. We want to see if such a deep
neural network can be trained. Assume also = 0:1 (learning rate), MaxIter = 1000, and
(k)(x) = tanh(x), k = 1;2;:::;l. Note that tanh0(x) = sech2(x) = 4(ex+e x)2 , which can be
used in your code. Include codes and plots in your homework. You may use any programming
language.
(a) Run your code by initializing all weights as 1.0. Plot the cost J de ned in page 40 of lecture
notes #6 as a function of the iteration. Cost should converge to 0, i.e., training should be
successful. Print out hk;i and gk;i, i = 1;2 obtained after just one iteration as functions of
k = 1;2;:::;l.
(b) Run your code by initializing all weights as 5. This time, the cost will be stuck at about
0.125, i.e., training will fail. Print out hk;i and gk;i, i = 1;2 obtained after just one iteration
as functions of k = 1;2;:::;l. How are they di erent from hk;i’s and gk;i’s you obtained in
(a)?
(c) Explain why training fails in (b). Try to see how hk;i’s behave as k increases (forward
propagation) and how the gradients gk;i’s behave as k decreases (backward propagation).
You will be able to see gk;i’s vanish as k decreases, which is known as the vanishing gradient
problem. Why does the gradient vanish as k decreases? How does the actual gradientrwJ
behave?
(d) Run your code by initializing all weights as 0.9. This time, the cost will be stuck at about
0.125, i.e., training will fail. Print out hk;i and gk;i, i = 1;2 obtained after just one iteration
as functions of k = 1;2;:::;l. How are they di erent from hk;i’s and gk;i’s you obtained in
(a) and (c)?
(e) Explain why training fails in (d). Try to see how hk;i’s behave as k increases (forward
propagation) and how the gradients gk;i’s behave as k decreases (backward propagation).
You will be able to see gk;i’s vanish as k decreases. Is this the only reason why training
fails? How does the actual gradient rwJ behave?
2. (Back-propagation with bias terms { 10 points)
(a) Let’s introduce a bias term in each layer in Example 1 in pages 40{42 in lecture notes #6,
i.e., hk = (k)(hk 1wk + 1bk), k = 1;2;:::;l, where 1 is an all-one vector of length m and
bk 2R is the bias term at the k-th layer. Note that you need 1 since there are m training
1
examples. De ne uk = hk 1wk + 1bk and gk =rukJ, k = 1;2;:::;l. Express gl in terms
of other variables such as ul, hl, and y. Express gk 1 in terms of other variables including
gk, i.e., backward propagation. Express @J@wk , @J@bk , k = 1;2:;:::;l using hk and gk. Show
the whole training algorithm including forward and backward propagations and gradient
descent as a pseudo code.
(b) In Example 2 in pages 43{46 in lecture notes #6, show that G(2) = 1mYT, where Yj;i = 1
if j = yi and 0 otherwise, which was de ned in page 44 of lecture notes #6.
(c) Let’s introduce a bias term in the rst layer in Example 2 in pages 43{46 in lecture notes
#6, i.e., H = (XW(1) + 1bT), where 1 is an all-one vector of length m and b is the bias
vector of length n1. Note that you need 1 since there are m training examples. De ne
U(1) = XW(1) + 1bT and G(1) = rU(1)JMLE. Assume the other variables such as U(2)
and G(2) are de ned the same way as done in Example 2. Express G(1) in terms of other
variables such as U(1), G(2), and W(2). Express rW(1)JMLE and rbJMLE using other
variables such as X and G(1).