"Machine Learning Algorithms and Implementations - Python Programming and Application Examples" Neural Network Training - Back Propagation Algorithm
[Copy link]
There is a problem in multi-layer neural networks: the parameters of the last layer can be solved in this way; the hidden layer nodes do not have true output values, so the loss function cannot be directly constructed to solve.
The back propagation algorithm can solve this problem. The back propagation algorithm is actually the application of the chain derivative rule.
According to the general solution idea of machine learning, we first determine the objective function of the neural network, and then use the stochastic gradient descent optimization algorithm to find the parameter value when the objective function is minimized.
Take the sum of squared errors of all output layer nodes of the network as the objective function:
Among them, Ed represents the error of sample d, t is the label value of the sample, and y is the output value of the neural network.
Then, the stochastic gradient descent algorithm is used to optimize the objective function:
The stochastic gradient descent algorithm needs to find the partial derivative (that is, the gradient) of the error Ed for each weight wji. How to solve it?
Observing the figure above, we can find that the weight wji can only affect the rest of the network by affecting the input value of node j. Let netj be the weighted input of node j, that is,
Ed is a function of netj, and netj is a function of wji. According to the chain rule, we can get:
In the above formula, xji is the input value passed by the node to node j, which is also the output value of node i.
For the derivation of Ed/netj, it is necessary to distinguish between the output layer and the hidden layer.
1. Output layer weight training
For the output layer, netj can only affect other parts of the network through the output value yj of node j, that is, Ed is a function of yj, and yj is a function of netj, where yj=sigmod(netj). So we can use the chain rule again:
in:
Substituting the first and second terms, we get:
If we set δj=Ed/netj, the error term δ of a node is the inverse of the partial derivative of the network error with respect to the input of this node. Substituting into the above formula, we get:
Substituting the above derivation into the stochastic gradient descent formula, we get:
2. Hidden layer weight training
Now we need to derive Ed/netj for the hidden layer:
First, we need to define the set of all direct downstream nodes of node j, Downstream(j). For example, for node 4, its direct downstream nodes are node 8 and node 9. It can be seen that netj can only affect Ed by affecting Downstream(j). Let netk be the input of the downstream node of node j, then Ed is a function of netk, and netk is a function of netj. Because there are multiple netks, we apply the total derivative formula and can make the following deduction:
Since δj=Ed/netj, substituting into the above formula we get:
So far, we have derived the back-propagation algorithm. It should be noted that the training rules we have just derived are based on the activation function being the sigmoid function, the squared error, the fully connected network, and the stochastic gradient descent optimization algorithm. If the activation function is different, the error calculation method is different, the network connection structure is different, and the optimization algorithm is different, then the specific training rules will also be different. But in any case, the derivation method of the training rules is the same, and the chain rule can be used for derivation.
3. Specific explanation
Then, the error term δi of each node is calculated as follows:
For output layer node i
Among them, δi is the error term of node i, yi is the output value of node i, and ti is the target value of the sample corresponding to node i. For example, according to the above figure, for output layer node 8, its output value is y1, and the target value of the sample is t1. Substituting it into the above formula, the error term of node 8 should be:
For hidden layer nodes
Among them, ai is the output value of node i, wki is the weight of the connection from node i to its next layer node k, and δk is the error term of the next layer node k of node i. For example, for hidden layer node 4, the calculation method is as follows:
Finally, update the weights on each connection:
Among them, wji is the weight from node i to node j, η is a constant called learning rate, δj is the error term of node j, and xji is the input from node i to node j. For example, the update method of weight w84 is as follows:
Similarly, the update method of weight w41 is as follows:
The input value of the bias term is always 1. For example, the bias term w4b of node 4 should be calculated as follows:
To calculate the error term of a node, you need to first calculate the error term of each node in the next layer connected to it. This requires that the calculation order of the error term must start from the output layer, and then reversely calculate the error term of each hidden layer in turn until the hidden layer connected to the input layer. This is the meaning of the name of the back propagation algorithm. When the error terms of all nodes are calculated, all weights can be updated according to Formula 5.
The above is a solution process of the reflection propagation algorithm. The whole process is also copied from other big guys. I hope it will be helpful to everyone.
|