Mathematical Question

1 Mathematical Question

1.1 Background

Stochastic Gradient Descent (SGD) is a commonly used optimization method for large-scale machine learning

models. Specifically, a typical machine learning model is defined as follows:

min

x f(x) = 1

i=1

fi(x) , (1)

where x 2 Rd represents the model parameter, n is the number of sample, fi(·) denotes the loss function for

the i-th sample. To optimize large-scale machine learning models, SGD randomly selects a subset of samples

at each iteration to compute the stochastic gradient

gt , 1

|S|

i2S

rfi(xt) , (2)

where xt is the model parameter at the t-th iteration, S ⇢ [n] is the randomly selected subset and |S| denotes

the size of the subset. Note that gt is the unbiased estimation of the full gradient, i.e. E[gt] = rf(xt). Then,

SGD uses this stochastic gradient to update the model parameter:

xt+1 = xt $ ⌘gt . (3)

To study the convergence rate of SGD, we have the following assumptions.

Assumption 1 The loss function f(x) has the Lipschitz continuous gradient with parameter L > 0, i.e.

kf(x) $ f(y)k  Lkx $ yk, 8x 2 Rd, 8y 2 Rd . (4)

Assumption 2 The gradient has bounded variance, i.e.

E[kgt $ rf(x)k2]  "2 . (5)

Based on these two assumptions, we can prove the convergence of SGD as follows.

Proof. According to Assumption 1, we have

E[f(xt+1)]  E[f(xt)] + E[hrf(xt), xt+1 $ xti] + L

E[kxt+1 $ xtk2]

= E[f(xt)] $ ⌘E[hrf(xt), gti] + ⌘2L

E[kgtk2]

= E[f(xt)] $ ⌘E[krf(xt)k2] + ⌘2L

E[kgt $ rf(x) + rf(x)k2]

 E[f(xt)] $ ⌘(1 $ ⌘L

2 )E[krf(xt)k2] + ⌘2L

E[kgt $ rf(x)k2]

 E[f(xt)] $ ⌘(1 $ ⌘L

2 )E[krf(xt)k2] + ⌘2"2L

(6)

By setting ⌘ < 1/L, we can get

E[f(xt+1)]  E[f(xt)] $ ⌘

E[krf(xt)k2] + ⌘2"2L

2 (7)

Summing up over t from 0 to T + 1, we have

E[f(xT +1)]  f(x0) $ ⌘

t=0

Ekrf(xt)k2] + ⌘2"2LT

2 (8)

Finally, we can get

t=0

E[krf(xt)k2]  2(f(x0) $ f(xT ))

⌘T + ⌘"2L (9)

By choosing ⌘ = O( p

T ), we have the convergence rate of SGD as follows:

t=0

E[krf(xt)k2]  O( 1

T ) (10)

1.2 Question

Although SGD is fast at each iteration, yet it has large variance when estimating the full gradient. Therefore,

its convergence speed is slow. Now, to accelerate the convergence speed, assume we have a new gradient

estimator vt for the full gradient rf(xt), and use it to update the model parameter:

xt+1 = xt $ ⌘vt . (11)

For this new estimator, it is not an unbiased estimator for rf(xt), i.e. E[vt] 6= rf(xt). But, we have the

following condition

t=0

E[kvt $ rf(x)k2]  ⌘2L2X

t=0

E[kvtk2]. (12)

Q: Given Eq.(4), Eq.(11), and Eq.(12), please prove that this method (using vt) can achieve the following

convergence rate

t=0

E[krf(xt)k2]  O(

T ) , (13)

and please give the value of the learning rate ⌘.

2 Coding Question

2.1 Background

Graph convolutional neural networks (GCN) have attracted a surge of attention in recent years, due to the

superior performance on the graph data. A typical GCN layer is defined as follows:

Z(l) = W(l)

H(l#1)A ,

H(l) = "(Z(l)

) , (14)

where H(l) = [h(l)

1 , h(l)

2 , ··· , h(l)

n ] 2 Rdl⇥n is the hidden feature of all nodes in the l-th layer, Z(l) =

(l)

1 , z(l)

2 , ··· , z(l)

n ] 2 Rdl⇥n is the hidden feature before conducting non-linear activation operation, "(·)

is the non-linear activation function, W(l) 2 Rdl⇥dl!1 is the model parameter in the l-th layer. Compared

with fully connected neural networks, GCN has an aggregation operation to leverage the relational information, which is denoted by H(l#1)A where A 2 Rn⇥n determines how to aggregate neighbours. Specifically, a

typical fully connected layer is defined as follows:

(l)

i = W(l)

h(l#1)

i . (15)

On the contrary, in the standard GCN [1], the GCN layer is defined as follows:

(l)

i = X

j2Ni

p|Ni||Nj |

W(l)

h(l#1)

j , (16)

where Ni denotes the neighbours of the i-th node and |Ni| represents the degree of the i-th node.

The aggregation operation brings new challenges for training GCN. In particular, di↵erent nodes are

dependent on each other in GCN, while samples are independent in regular deep neural networks. Therefore,

existing GCNs usually use the full gradient descent method to learn model parameters to avoid dealing with

the dependence between di↵erent nodes, such as this method https://github.com/tkipf/pygcn.

2.2 Question

Please use stochastic gradient descent method (SGD or its variants) to train GCN. In other words, please

re-implement the training method in https://github.com/tkipf/pygcn. At each iteration, you need to

sample a sub-graph rather than the entire graph to learn the model parameter (Here is an example using

SGD to train the regular CNN https://github.com/pytorch/examples/blob/master/mnist/main.py).

Note that the prediction performance will become worse when using SGD (or its variants) to train GCN.

Sampling a good subgraph will alleviate this issue.

References

1. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint

arXiv:1609.02907, 2016.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 ddes9012 critical appro... 2025-07-15
讲解 biochemical engineering... 2025-07-15
讲解 infs 5135 assignment –... 2025-07-15
辅导 visual analytics resit ... 2025-07-15
辅导 fite7407 securities tra... 2025-07-15
讲解 aero2363 – aerospace s... 2025-07-15
辅导 gfw0033 pembangunan &am... 2025-07-15
讲解 surfaces of section讲解... 2025-07-15
讲解 rlgn 3270 guru and disc... 2025-07-15
gsoe9340讲解、辅导 java/pyt... 2025-07-15
辅导 assignment讲解 java语言 2025-07-15
讲解 cs 6043 – fall 2024 ho... 2025-07-15
辅导 imel7000/ecen7003_001 f... 2025-07-15
讲解 program、辅导 java/c++编... 2025-07-14
辅导 program、讲解 python/c+... 2025-07-14
讲解 a34315 number theory讲解... 2025-07-14
辅导 a36779 statistical meth... 2025-07-14
讲解 a37847 algebra & co... 2025-07-14
辅导 auto msc consultancy pr... 2025-07-14
讲解 hse208 integrated human... 2025-07-14

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！