讲解留学生ECE 477 Computer AuditionMatlab、Python语言程序解析、讲解留学生SQL语言程序

Neural Networks Without Biology
More on "Neural Networks"
Yujia Yan
Fall 2018
ECE 477 Computer Audition
Neural Networks Without Biology
Outline
Neural Networks Without Biology
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks Without Biology
"Linear Regression"
y = Wx + b (1)
"Polynomial Regression"
y = W (x) + b (2)
where (x) gives the polynomial basis, e.g., [x1;x21;x2;x22;:::]T
"Adaptive Basis Regression"
y = W2(f (W1x + b1)) + b2 (3)
This is exactly the neural network with a single hidden layer
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks Without Biology
"Linear Regression"
y = Wx + b (1)
"Polynomial Regression"
y = W (x) + b (2)
where (x) gives the polynomial basis, e.g., [x1;x21;x2;x22;:::]T
"Adaptive Basis Regression"
y = W2(f (W1x + b1)) + b2 (3)
This is exactly the neural network with a single hidden layer
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks Without Biology
"Linear Regression"
y = Wx + b (1)
"Polynomial Regression"
y = W (x) + b (2)
where (x) gives the polynomial basis, e.g., [x1;x21;x2;x22;:::]T
"Adaptive Basis Regression"
y = W2(f (W1x + b1)) + b2 (3)
This is exactly the neural network with a single hidden layer
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks Without Biology
"Linear Regression"
y = Wx + b (1)
"Polynomial Regression"
y = W (x) + b (2)
where (x) gives the polynomial basis, e.g., [x1;x21;x2;x22;:::]T
"Adaptive Basis Regression"
y = W2(f (W1x + b1)) + b2 (3)
This is exactly the neural network with a single hidden layer
ECE 477 Computer Audition
Neural Networks Without Biology
The Universal Approximator
Feedforward Neural network with a single hidden layer
y = W2(f (W1x + b1)) + b2 (4)
Universal Approximation Theorem states that in case that the
nonlinear activation function f ( ) ful lls some mild conditions, it
can approximate any continuous function in a bounded domain if
the hidden layer is wide enough.
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Going Deep
Feedforward Neural network with a single hidden layer
y = W2(f (W1x + b1)) + b2 (5)
Interpretation: We compute the similarity between entries in
W1 = [w1;w2;:::] and x by taking the inner product to obtain a
basis for regression.
However, it is not e cient to memorize many patterns.
Solution: Going deep
y = WN(:::W3f (W2(f (W1x + b1)) + b2) + b3:::) (6)
The total number of patterns memorized goes exponential with the
number of layers! But sometimes ’wide’ is also needed (Why?)
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Going Deep
Feedforward Neural network with a single hidden layer
y = W2(f (W1x + b1)) + b2 (5)
Interpretation: We compute the similarity between entries in
W1 = [w1;w2;:::] and x by taking the inner product to obtain a
basis for regression.
However, it is not e cient to memorize many patterns.
Solution: Going deep
y = WN(:::W3f (W2(f (W1x + b1)) + b2) + b3:::) (6)
The total number of patterns memorized goes exponential with the
number of layers! But sometimes ’wide’ is also needed (Why?)
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Going Deep
Feedforward Neural network with a single hidden layer
y = W2(f (W1x + b1)) + b2 (5)
Interpretation: We compute the similarity between entries in
W1 = [w1;w2;:::] and x by taking the inner product to obtain a
basis for regression.
However, it is not e cient to memorize many patterns.
Solution: Going deep
y = WN(:::W3f (W2(f (W1x + b1)) + b2) + b3:::) (6)
The total number of patterns memorized goes exponential with the
number of layers! But sometimes ’wide’ is also needed (Why?)
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: A Composition of Operations
In fact, We can compose everything into neural networks as long
as we know how to train it.
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: A Composition of Operations
From the computation perspective, a neural network can be viewed
as a set of operations composed within a computational graph.
Example:
For training
L(W2(f (W1x + b1)) + b2;YGT )
Where L is the loss function, YGT is the tting target. The
corresponding computational graph is:
g38g12
g39
matmul g47g9 g4DA gA
g43g12
add
g38g13
matmul
g43g13
add
g2Dg9 g4DA gD g4DA gAg3Ag28g35
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Computational Graph
g38g12
g39
matmul g47g9 g4DA gA
g43g12
add
g38g13
matmul
g43g13
add
g2Dg9 g4DA gD g4DA gAg3Ag28g35

1. Nodes without incoming edges are variables
2. Nodes with incoming edges are operations, producing
intermediate variables
3. Edge var !Op means var is an argument of Op
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Computational Graph
g38g12
g39
matmul g47g9 g4DA gA
g43g12
add
g38g13
matmul
g43g13
add
g2Dg9 g4DA gD g4DA gAg3Ag28g35

1. Nodes without incoming edges are variables
2. Nodes with incoming edges are operations, producing
intermediate variables
3. Edge var !Op means var is an argument of Op
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Computational Graph
g38g12
g39
matmul g47g9 g4DA gA
g43g12
add
g38g13
matmul
g43g13
add
g2Dg9 g4DA gD g4DA gAg3Ag28g35

1. Nodes without incoming edges are variables
2. Nodes with incoming edges are operations, producing
intermediate variables
3. Edge var !Op means var is an argument of Op
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Computational Graph
g38g12
g39
matmul g47g9 g4DA gA
g43g12
add
g38g13
matmul
g43g13
add
g2Dg9 g4DA gD g4DA gAg3Ag28g35

1. Nodes without incoming edges are variables
2. Nodes with incoming edges are operations, producing
intermediate variables
3. Edge var !Op means var is an argument of Op
ECE 477 Computer Audition
Neural Networks Without Biology
A little bit vector calculus
Jacobian Matrix
@y
@x =
2
64
@y1
@x1
@y1
@xN.
.. ... ...
@yM
@x1
@yM
@xN
3
75
Example:
For y = Ax, where A is a matrix and x is a vector:
@y
@x = A
ECE 477 Computer Audition
Neural Networks Without Biology
A little bit vector calculus
Jacobian Matrix
@y
@x =
2
64
@y1
@x1
@y1
@xN.
.. ... ...
@yM
@x1
@yM
@xN
3
75
Gradient
We use the column vector convention for the gradient:
rxL =
2
64
@L
@x1.
..
@L
@xN
3
75 = (@L
@x )
T
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Training
Training is performed by minimizing the loss function with
stochastic gradient decent:
rbatch L
where is the parameters of the model, is the step size.
It is called stochastic because the gradient w.r.t. is evaluated
only over a small random subset of data (minibatch).
Many di erent methods to average and scale the gradient for
updating parameters exist: Adam, RMSProp, SGD with
Momentum, etc.
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Training
Training is performed by minimizing the loss function with
stochastic gradient decent:
rbatch L
where is the parameters of the model, is the step size.
It is called stochastic because the gradient w.r.t. is evaluated
only over a small random subset of data (minibatch).
Many di erent methods to average and scale the gradient for
updating parameters exist: Adam, RMSProp, SGD with
Momentum, etc.
ECE 477 Computer Audition
Neural Networks Without Biology
Neural Networks: Training
Training is performed by minimizing the loss function with
stochastic gradient decent:
rbatch L
where is the parameters of the model, is the step size.
It is called stochastic because the gradient w.r.t. is evaluated
only over a small random subset of data (minibatch).
Many di erent methods to average and scale the gradient for
updating parameters exist: Adam, RMSProp, SGD with
Momentum, etc.
ECE 477 Computer Audition
Neural Networks Without Biology
Automatic Reverse-Mode Di erentiation
How to calculate gradient?
Explicitly storing the computational graph and every value for each
node allows automatic computation of gradients from the end to
the beginning (reverse topological order), which is known as
Automatic Reverse-Mode Di erentiation or Backpropagation.
Modern deep learning frameworks
1. Tensor ow, MXNet: build the graph rst and then perform
computation using the graph (static graph)
2. Pytorch, Dynet, tensor ow eager: record the graph while
doing computation (dynamic graph)
ECE 477 Computer Audition
Neural Networks Without Biology
Automatic Reverse-Mode Di erentiation
How to calculate gradient?
Explicitly storing the computational graph and every value for each
node allows automatic computation of gradients from the end to
the beginning (reverse topological order), which is known as
Automatic Reverse-Mode Di erentiation or Backpropagation.
Modern deep learning frameworks
1. Tensor ow, MXNet: build the graph rst and then perform
computation using the graph (static graph)
2. Pytorch, Dynet, tensor ow eager: record the graph while
doing computation (dynamic graph)
ECE 477 Computer Audition
Neural Networks Without Biology
Automatic Reverse-Mode Di erentiation
How to calculate gradient?
Explicitly storing the computational graph and every value for each
node allows automatic computation of gradients from the end to
the beginning (reverse topological order), which is known as
Automatic Reverse-Mode Di erentiation or Backpropagation.
Modern deep learning frameworks
1. Tensor ow, MXNet: build the graph rst and then perform
computation using the graph (static graph)
2. Pytorch, Dynet, tensor ow eager: record the graph while
doing computation (dynamic graph)
ECE 477 Computer Audition
Neural Networks Without Biology
Automatic Reverse-Mode Di erentiation
How to calculate gradient?
Explicitly storing the computational graph and every value for each
node allows automatic computation of gradients from the end to
the beginning (reverse topological order), which is known as
Automatic Reverse-Mode Di erentiation or Backpropagation.
Modern deep learning frameworks
1. Tensor ow, MXNet: build the graph rst and then perform
computation using the graph (static graph)
2. Pytorch, Dynet, tensor ow eager: record the graph while
doing computation (dynamic graph)
ECE 477 Computer Audition
Neural Networks Without Biology
Automatic Reverse-Mode Di erentiation
How to calculate gradient?
Explicitly storing the computational graph and every value for each
node allows automatic computation of gradients from the end to
the beginning (reverse topological order), which is known as
Automatic Reverse-Mode Di erentiation or Backpropagation.
Modern deep learning frameworks
1. Tensor ow, MXNet: build the graph rst and then perform
computation using the graph (static graph)
2. Pytorch, Dynet, tensor ow eager: record the graph while
doing computation (dynamic graph)
ECE 477 Computer Audition
Neural Networks Without Biology
Chain rule with computational graph
Now we assume all nodes (variable or intermediate variables)
except for the last node (a scalar function) are vectors.
For one node in the computational graph,
g37g42g53
g50 g9gFgFgFgDg37g42g53gD gFgFgFgAg51g12
g50 g9gFgFgFgDg37g42g53gD gFgFgFgAg51g2F
g503g504
ECE 477 Computer Audition
Neural Networks Without Biology
Chain rule with computational graph
Now we assume all nodes (variable or intermediate variables)
except for the last node (a scalar function) are vectors.
For one node in the computational graph,
g37g42g53
g50 g9gFgFgFgDg37g42g53gD gFgFgFgAg51g12
g50 g9gFgFgFgDg37g42g53gD gFgFgFgAg51g2F
g503g504
rVarL =
X
i
(@opi@Var)TropiL
where @opi@Var is the Jacobian matrix.
ECE 477 Computer Audition
Neural Networks Without Biology
Chain rule with computational graph
Now we assume all nodes (variable or intermediate variables)
except for the last node (a scalar function) are vectors.
For one node in the computational graph,
g37g42g53
g50 g9gFgFgFgDg37g42g53gD gFgFgFgAg51g12
g50 g9gFgFgFgDg37g42g53gD gFgFgFgAg51g2F
g503g504
For implementing reverse-mode AD, we need to store all
intermediate values for all nodes, which usually uses a lot of
memory.
Also, propagating gradient along the edge is multiplicative, which
means it is easy to get over ow (gradient explosion) or under ow
(gradient vanishing)
ECE 477 Computer Audition
Neural Networks Without Biology
Elementwise Nonlinear Function
Most nonlinear functions used in Neural Networks are elementwise
or can be constructed from a combination of Matrix multiplication
and Elementwise Nonlinear Functions.
The Jacobian is diagonal, therefore the terms within the chain rule
can be computed element-wise:
@f
@Vari (rf L)i
ECE 477 Computer Audition
Neural Networks Without Biology
Elementwise Nonlinear Function
Most nonlinear functions used in Neural Networks are elementwise
or can be constructed from a combination of Matrix multiplication
and Elementwise Nonlinear Functions.
The Jacobian is diagonal, therefore the terms within the chain rule
can be computed element-wise:
@f
@Vari (rf L)i
ECE 477 Computer Audition
Neural Networks Without Biology
Elementwise Nonlinear Function
Most nonlinear functions used in Neural Networks are elementwise
or can be constructed from a combination of Matrix multiplication
and Elementwise Nonlinear Functions.
The Jacobian is diagonal, therefore the terms within the chain rule
can be computed element-wise:
@f
@Vari (rf L)i
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization
What if a variable is not a vector, e.g., a Matrix, a Tensor?
We do vectorization (assuming column vectors):
vec(
1 2
3 4

) =
2
66
4
1
3
2
4
3
77
5
Usually it’s the storage layout for matrix/tensor; no additional cost
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization
What if a variable is not a vector, e.g., a Matrix, a Tensor?
We do vectorization (assuming column vectors):
vec(
1 2
3 4

) =
2
66
4
1
3
2
4
3
77
5
Usually it’s the storage layout for matrix/tensor; no additional cost
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization
What if a variable is not a vector, e.g., a Matrix, a Tensor?
We do vectorization (assuming column vectors):
vec(
1 2
3 4

) =
2
66
4
1
3
2
4
3
77
5
Usually it’s the storage layout for matrix/tensor; no additional cost
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication
Matrix multiplication is important because it can represent the
largest portion of operations in a neural network (Linear Layer,
Convolution Layer, etc.).
We use the the identity:
vec(ABC) = (CT A)vec(B)
where is the Kronecker product
A B =
2
64
a11B ::: a1NB
... ... ...
aM1B ::: aMNB
3
75
Good News: typically, there’s no need to calculate the Kronecker
product explicitly.
Examples:
Assuming A and X are M N and N P matrices respectively
vec(AX) = vec(IMAX)
= (XT IM)vec(A)
Then we have @vec(AX)
@vec(A) = X
T IM
To propagate gradient from AX to A, denote
vec( AX) =rvec(AX)L, where AX has the same shape as AX
vec( A) = (XT IM)Tvec( AX) = (X IM)vec( AX) = vec( AXXT)
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication
Examples:
Assuming A and X are M N and N P matrices respectively
vec(AX) =
vec(IMAX)
= (XT IM)vec(A)
Then we have @vec(AX)
@vec(A) = X
T IM
To propagate gradient from AX to A, denote
vec( AX) =rvec(AX)L, where AX has the same shape as AX
vec( A) = (XT IM)Tvec( AX) = (X IM)vec( AX) = vec( AXXT)
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication
Examples:
Assuming A and X are M N and N P matrices respectively
vec(AX) = vec(IMAX)
= (XT IM)vec(A)
Then we have @vec(AX)
@vec(A) = X
T IM
To propagate gradient from AX to A, denote
vec( AX) =rvec(AX)L, where AX has the same shape as AX
vec( A) = (XT IM)Tvec( AX) = (X IM)vec( AX) = vec( AXXT)
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication
Examples:
Assuming A and X are M N and N P matrices respectively
vec(AX) = vec(IMAX)
= (XT IM)vec(A)
Then we have @vec(AX)
@vec(A) = X
T IM
To propagate gradient from AX to A, denote
vec( AX) =rvec(AX)L, where AX has the same shape as AX
vec( A) = (XT IM)Tvec( AX) = (X IM)vec( AX) = vec( AXXT)
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication
Examples:
Assuming A and X are M N and N P matrices respectively
vec(AX) = vec(IMAX)
= (XT IM)vec(A)
Then we have @vec(AX)
@vec(A) = X
T IM
To propagate gradient from AX to A, denote
vec( AX) =rvec(AX)L, where AX has the same shape as AX
vec( A) = (XT IM)Tvec( AX) = (X IM)vec( AX) = vec( AXXT)
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication
Examples:
Assuming A and X are M N and N P matrices respectively
vec(AX) = vec(IMAX)
= (XT IM)vec(A)
Then we have @vec(AX)
@vec(A) = X
T IM
To propagate gradient from AX to A, denote
vec( AX) =rvec(AX)L, where AX has the same shape as AX
vec( A) = (XT IM)Tvec( AX) = (X IM)vec( AX) = vec( AXXT)
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication
Examples:
Assuming A and X are M N and N P matrices respectively
vec(AX) = vec(IMAX)
= (XT IM)vec(A)
Then we have @vec(AX)
@vec(A) = X
T IM
To propagate gradient from AX to A, denote
vec( AX) =rvec(AX)L, where AX has the same shape as AX
vec( A) = (XT IM)Tvec( AX) = (X IM)vec( AX) = vec( AXXT)
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication: Another
Side
Similarly, to propagate gradient from AX to X,
vec( X) = vec(AT AX)
If we view multiplying A to be an operator inside a neural network,
then the gradient is propagated by applying its transposed
operator.
It applies to all nite dimensional linear operators used in machine
learning.
For example, for calculating the gradient of a convolution, we need
to use the transposed convolution.
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication: Another
Side
Similarly, to propagate gradient from AX to X,
vec( X) = vec(AT AX)
If we view multiplying A to be an operator inside a neural network,
then the gradient is propagated by applying its transposed
operator.
It applies to all nite dimensional linear operators used in machine
learning.
For example, for calculating the gradient of a convolution, we need
to use the transposed convolution.
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication: Another
Side
Similarly, to propagate gradient from AX to X,
vec( X) = vec(AT AX)
If we view multiplying A to be an operator inside a neural network,
then the gradient is propagated by applying its transposed
operator.
It applies to all nite dimensional linear operators used in machine
learning.
For example, for calculating the gradient of a convolution, we need
to use the transposed convolution.
ECE 477 Computer Audition
Neural Networks Without Biology
Vectorization of Matrix-Matrix Multiplication: Another
Side
Similarly, to propagate gradient from AX to X,
vec( X) = vec(AT AX)
If we view multiplying A to be an operator inside a neural network,
then the gradient is propagated by applying its transposed
operator.
It applies to all nite dimensional linear operators used in machine
learning.
For example, for calculating the gradient of a convolution, we need
to use the transposed convolution.
ECE 477 Computer Audition
Neural Networks Without Biology
Congratulations!
Now you know how to implement your own deep leanring
framework how these frameworks work.
Electricy
DataModel
GPUWorkstation
HumanEffort
ECE 477 Computer Audition