首页 >
> 详细

1 Mathematical Question

1.1 Background

Stochastic Gradient Descent (SGD) is a commonly used optimization method for large-scale machine learning

models. Specifically, a typical machine learning model is defined as follows:

min

x f(x) = 1

n

Xn

i=1

fi(x) , (1)

where x 2 Rd represents the model parameter, n is the number of sample, fi(·) denotes the loss function for

the i-th sample. To optimize large-scale machine learning models, SGD randomly selects a subset of samples

at each iteration to compute the stochastic gradient

gt , 1

|S|

Xn

i2S

rfi(xt) , (2)

where xt is the model parameter at the t-th iteration, S ⇢ [n] is the randomly selected subset and |S| denotes

the size of the subset. Note that gt is the unbiased estimation of the full gradient, i.e. E[gt] = rf(xt). Then,

SGD uses this stochastic gradient to update the model parameter:

xt+1 = xt $ ⌘gt . (3)

To study the convergence rate of SGD, we have the following assumptions.

Assumption 1 The loss function f(x) has the Lipschitz continuous gradient with parameter L > 0, i.e.

kf(x) $ f(y)k Lkx $ yk, 8x 2 Rd, 8y 2 Rd . (4)

Assumption 2 The gradient has bounded variance, i.e.

E[kgt $ rf(x)k2] "2 . (5)

Based on these two assumptions, we can prove the convergence of SGD as follows.

Proof. According to Assumption 1, we have

E[f(xt+1)] E[f(xt)] + E[hrf(xt), xt+1 $ xti] + L

2

E[kxt+1 $ xtk2]

= E[f(xt)] $ ⌘E[hrf(xt), gti] + ⌘2L

2

E[kgtk2]

= E[f(xt)] $ ⌘E[krf(xt)k2] + ⌘2L

2

E[kgt $ rf(x) + rf(x)k2]

E[f(xt)] $ ⌘(1 $ ⌘L

2 )E[krf(xt)k2] + ⌘2L

2

E[kgt $ rf(x)k2]

E[f(xt)] $ ⌘(1 $ ⌘L

2 )E[krf(xt)k2] + ⌘2"2L

2

(6)

By setting ⌘ < 1/L, we can get

E[f(xt+1)] E[f(xt)] $ ⌘

2

E[krf(xt)k2] + ⌘2"2L

2 (7)

2

Summing up over t from 0 to T + 1, we have

E[f(xT +1)] f(x0) $ ⌘

2

X

T

t=0

Ekrf(xt)k2] + ⌘2"2LT

2 (8)

Finally, we can get

1

T

X

T

t=0

E[krf(xt)k2] 2(f(x0) $ f(xT ))

⌘T + ⌘"2L (9)

By choosing ⌘ = O( p

1

T ), we have the convergence rate of SGD as follows:

1

T

X

T

t=0

E[krf(xt)k2] O( 1

p

T ) (10)

1.2 Question

Although SGD is fast at each iteration, yet it has large variance when estimating the full gradient. Therefore,

its convergence speed is slow. Now, to accelerate the convergence speed, assume we have a new gradient

estimator vt for the full gradient rf(xt), and use it to update the model parameter:

xt+1 = xt $ ⌘vt . (11)

For this new estimator, it is not an unbiased estimator for rf(xt), i.e. E[vt] 6= rf(xt). But, we have the

following condition

X

T

t=0

E[kvt $ rf(x)k2] ⌘2L2X

T

t=0

E[kvtk2]. (12)

Q: Given Eq.(4), Eq.(11), and Eq.(12), please prove that this method (using vt) can achieve the following

convergence rate

1

T

X

T

t=0

E[krf(xt)k2] O(

1

T ) , (13)

and please give the value of the learning rate ⌘.

2 Coding Question

2.1 Background

Graph convolutional neural networks (GCN) have attracted a surge of attention in recent years, due to the

superior performance on the graph data. A typical GCN layer is defined as follows:

Z(l) = W(l)

H(l#1)A ,

H(l) = "(Z(l)

) , (14)

where H(l) = [h(l)

1 , h(l)

2 , ··· , h(l)

n ] 2 Rdl⇥n is the hidden feature of all nodes in the l-th layer, Z(l) =

[z

(l)

1 , z(l)

2 , ··· , z(l)

n ] 2 Rdl⇥n is the hidden feature before conducting non-linear activation operation, "(·)

is the non-linear activation function, W(l) 2 Rdl⇥dl!1 is the model parameter in the l-th layer. Compared

with fully connected neural networks, GCN has an aggregation operation to leverage the relational information, which is denoted by H(l#1)A where A 2 Rn⇥n determines how to aggregate neighbours. Specifically, a

typical fully connected layer is defined as follows:

z

(l)

i = W(l)

h(l#1)

i . (15)

3

On the contrary, in the standard GCN [1], the GCN layer is defined as follows:

z

(l)

i = X

j2Ni

1

p|Ni||Nj |

W(l)

h(l#1)

j , (16)

where Ni denotes the neighbours of the i-th node and |Ni| represents the degree of the i-th node.

The aggregation operation brings new challenges for training GCN. In particular, di↵erent nodes are

dependent on each other in GCN, while samples are independent in regular deep neural networks. Therefore,

existing GCNs usually use the full gradient descent method to learn model parameters to avoid dealing with

the dependence between di↵erent nodes, such as this method https://github.com/tkipf/pygcn.

2.2 Question

Please use stochastic gradient descent method (SGD or its variants) to train GCN. In other words, please

re-implement the training method in https://github.com/tkipf/pygcn. At each iteration, you need to

sample a sub-graph rather than the entire graph to learn the model parameter (Here is an example using

SGD to train the regular CNN https://github.com/pytorch/examples/blob/master/mnist/main.py).

Note that the prediction performance will become worse when using SGD (or its variants) to train GCN.

Sampling a good subgraph will alleviate this issue.

References

1. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint

arXiv:1609.02907, 2016.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp

- Cits1001留学生语言代做、Java语言程序调试、Java实验编程代写调 2020-10-26
- 代写cfrm 542实验程序、R课程编程代做、代写program程序语言代做 2020-10-26
- Math2392语言代做、代写r课程程序、R实验编程代做帮做spss|代做r 2020-10-26
- 代做comp10002程序、代写systems编程语言、Web/Html实验 2020-10-26
- Math2392编程代写、Data留学生程序代做、Sql编程语言代写代写py 2020-10-26
- Cse 320实验代做、代写programming课程、C/C++程序实验代 2020-10-25
- Infs2200语言代写、代做sql留学生程序、Sql编程语言代做代做pro 2020-10-25
- Data留学生编程代写、代做c++课程编程、Python，Java语言程序代 2020-10-25
- Program留学生编程代做、Matlab课程编程代做、代写matlab编程 2020-10-25
- 代做software课程、Python，C++，Java编程设计代写、代做d 2020-10-25
- Csc361留学生程序代做、Python实验编程代写、Programming 2020-10-25
- Cits1401语言代写、代做python课程程序、Python语言编程调试 2020-10-25
- 代写msba7003课程、Java编程语言调试、Python，C++程序实验 2020-10-25
- Algorithms课程编程代写、代做python语言、代写java，C++ 2020-10-25
- 代做msba7003程序、Python，C++语言程序代写、代写java课程 2020-10-25
- Data Structures代做、C++,Java编程语言代写、Algor 2020-10-25
- 代写msba7003编程、代做analysis程序语言、代做java，Pyt 2020-10-25
- Csse2310实验程序代写、代做c++，Python，Java编程语言 帮 2020-10-25
- 代写csse2310编程、代做programming程序、代做c++实验编程 2020-10-25
- 代做java程序|代做java程序|代写留学生pr... 2020-10-25