A simple explanation of Tensorflow principles and Python code examples
[Copy link]
This article uses actual cases to explain the principles of Tensorflow and how to implement its functions with the simplest Python code.1. Principles of Neural NetworksThe neural network model is a typical supervised learning problem mentioned in the previous chapter, that is, we have a set of inputs and corresponding target outputs, and we need to find the optimal model. Through the optimal model, when we have new inputs, we can get a prediction output that is close to the truth. 50)]Let's first take a look at how to implement such a simple neural network:Find the required parameters w1_0 w2_0 b1_0 b2_0, so that the output obtained under the given input x and the mean mean square error (MSE) between the target output are minimized. First, we need to think about how many parameters are there? Since it is a two-layer neural network, the structure is as follows (picture source http://stackoverflow.com/questio ... tion-training-stuck), where the input layer is 3, the middle layer is 4, and the output layer is 2: Therefore, it contains a total of (3x4+4) + (4*2+2) = 26 parameters need to be trained. We can initialize the parameters as shown in the figure. , 1.7, 1.8], [ 1.9, 2.0]]) # python import numpy as np w1_0 = np.array([[ 0.1, 0.2, 0.3, 0.4], [ 0.5, 0.6, 0.7, 0.8], [ 0.9, 1.0, 1.1, 1.2]]) w2_0 = np.array([[ 1.3, 1.4], [ 1.5, 1.6], [ 1.7, 1.8], [ 1.9, 2.0]]) b1_0 = np.array([-2.0, -6.0, -1.0, # python x = [1,2,3] y = [-0.85, 0.72] o1 = np.dot(x, w1_0 ) + b1_0 os1 = np.power(1+np.exp(o1*-1), -1) o2 = np.dot(os1, w2_0) + b2_0 os2 = np.tanh(o2) Perform backpropagation again: # pythonalpha = 0.1 grad_os2 = (y - os2) * (1-np.power(os2, 2)) grad_os1 = np.dot(w2_0, grad_os2.T).T * (1-os1)*os1 grad_w2 = ...[/ font] grad_b2 = ... ... ...[ /font] w2_0 = w2_0 + alpha * grad_w2 b2_0 = b2_0 + alpha * grad_b2 ...[/size ] ... [color= rgb(50, 50, 50)] Repeat this process many times until the error converges. When backpropagating, you need to write down the derivative results of all parameters, and then Update the parameters based on the derivation results. I didn’t write all of them here because it’s too troublesome to derive them layer by layer. More importantly, when we need to train a new neural network structure, all of them need to be re-derived, which is time-consuming. However, if you think about it carefully, this derivation process is not without rules. That is, the gradient output of the neural network at the previous level will be used for the calculation of the next level. The gradient is input, and the output of the gradient calculation of the next level will be used as the input of the neural network of the previous level. So we thought about whether we can abstract this process and make it into a framework that can automatically calculate the gradient. OK, with A series of deep learning frameworks represented by Tensorflow were born based on this idea. myFont, "]2. Deep Learning Framework What is the most popular deep learning framework in recent years? Without a doubt, Tensorflow is the winner. [img=600, 0]https://forum.mianbaoban.cn/data/attachment/forum/201805/09/132537w7tt4bz9tzyef4fl.jpg[/img] But in fact, these deep learning frameworks all have some common characteristics. Gokula Krishnan Santhanam believes that most deep learning frameworks contain the following five core components: BLAS, cuBLAS, cuDNN and other expansion packages Among them, Tensor It can be understood as an array of any dimension - for example, a one-dimensional array is called a vector, and a two-dimensional array is called a matrix. These are all tensors. With tensors, there are corresponding basic operations. For example, taking the value of a row or column, multiplying a tensor by a constant, etc. Using the expansion pack is actually equivalent to using the underlying computing software to accelerate the calculation. ]Today we will focus on the computational graph model and automatic differentiation. First, we will take the Torch framework as an example. Let’s talk about how to achieve automatic differentiation, and then use the simplest method to implement these two parts. 2.1. How to achieve automatic differentiation in deep learning frameworks Getting started with deep learning frameworks like Tensorflow There are a lot of materials on the Internet that can quickly implement simple tasks such as handwritten digit recognition with just a few lines of code and a few minutes of getting started. However, if you want to understand the principles behind Tensorflow in depth, it may not be so easy. Here we briefly talk about Let's talk about this part. We know that when we get the data and train the neural network, all the parameters in the network are variables. The process of training the model is how to get a set of optimal variables to make the prediction most accurate. This process is actually that the input data is forward propagated to become a prediction, and then the error between the prediction and the actual situation is back-propagated to update the variables. This is repeated many times to get the optimal parameters. Here we will encounter a problem. With so many layers of neural network, how to ensure that both forward and backward propagation can run correctly? It is worth thinking about that these two propagation methods have the characteristics of pipeline propagation. Forward propagation can be calculated layer by layer, and the result of the previous layer of the network is used as the input of the next layer. The back propagation process can use the chain derivative rule to continuously distribute the error to each parameter from back to front. After abstraction, we found that each module in the deep learning framework requires two functions, one connected to the forward direction and the other to the reverse direction. The forward and reverse directions here are like the Ren and Du meridians in martial arts novels. In the process of training the model, the data is forward propagated to generate prediction results, and then the error is transmitted back to update the parameters, just like letting the true qi flow through the Ren and Du meridians in the body. As the training error gradually decreases and converges, the deep neural network will also open up the Ren and Du meridians. Next, we will first examine how the source code of the Torch framework implements these two parts. Secondly, we will directly write a simplest deep learning framework in Python. The example of Torch's nn project is given because the code file structure of Torch is relatively simple. The rules of Tensorflow are similar to Torch, but the file structure is relatively more complicated. If you are interested, you can read the relevant articles carefully. Almost all .lua files in the Torch nn module Github source code directory have these two functions: Almost all .lua files in this directory of the Torch nn module Github source code have these two functions: This is actually equivalent to leaving the definition of two methods without writing specific functions. The codes for specific functions are implemented in C in the ./lib/THNN/generic directory, with the Sigmoid function as an example. We know that the form of the Sigmoid function is:https://forum.mianbaoban.cn/data/attachment/forum/201805/09/132705pziin2fmcimvsjw3.png[/img] The code is implemented like this: # luavoid THNN_(Sigmoid_updateOutput)( THNNState *state, THTensor *input, THTensor *output){ THTensor_(resizeAs)(output,input);
TH_TENSOR_APPLY2(real, output, real, input,
*output_data = 1./(1.+ exp(- *input_data));
);
}
Sigmoid 函数求导变成: 所以这里在实现的时候就是:
// cvoid THNN_(Sigmoid_updateGradInput)(
THNNState *state,
THTensor *input,
THTensor *gradOutput,
THTensor *gradInput,
THTensor *output){
THNN_CHECK_NELEMENT(input, gradOutput);
THTensor_(resizeAs)(gradInput, output);
TH_TENSOR_APPLY3(real, gradInput, real, gradOutput, real, output,
real z = * output_data;
*gradInput_data = *gradOutput_data * (1. - z) * z;
);
}
大家应该注意到了一点, updateOutput 函数, output_data 在等号左边, input_data 在等号右边。 而 updateGradInput 函数, gradInput_data 在等号左边, gradOutput_data 在等号右边。 这里,output = f(input) 对应的是 正向传播 input = f(output) 对应的是 反向传播。
2.2 用 Python 直接编写一个最简单的深度学习框架
这部分内容属于“造轮子”,并且借用了优达学城的一个小型项目 MiniFlow。
Data structure part First, we implement a parent class Node, and then based on this parent class, we implement modules such as Input Linear Sigmoid in turn. Simple Python Class inheritance is used here. In these modules, the forward and backward methods need to be rewritten for each module respectively. The code is as follows: # pythonclass Node(object): """ Base class for nodes in the network. Arguments: `inbound_nodes`: A list of nodes with edges into this node. """ def __init__(self, inbound_nodes=[]): """ Node's constructor (runs when the object is instantiated). Sets properties that all nodes need. """ # A list of nodes with edges into this node. self.inbound_nodes = inbound_nodes # The eventual value of this node. Set by running # the forward() method. self.value = None # A list of nodes that this node outputs to. self.outbound_nodes = [] # New property! Keys are the inputs to this node and # their values are the partials of this node with # respect to that input. self.gradients = {} # Sets this node as an outbound node for all of # this node's inputs. for node in inbound_nodes:
node.outbound_nodes.append(self) def forward(self):
"""
Every node that uses this class as a base class will
need to define its own `forward` method.
"""
raise NotImplementedError def backward(self):
"""
Every node that uses this class as a base class will
need to define its own `backward` method.
"""
raise NotImplementedErrorclass Input(Node):
"""
A generic input into the network.
"""
def __init__(self):
Node.__init__(self) def forward(self):
pass
def backward(self):
self.gradients = {self: 0} for n in self.outbound_nodes:
self.gradients[self] += n.gradients[self]class Linear(Node):
"""
Represents a node that performs a linear transform.
"""
def __init__(self, X, W, b):
Node.__init__(self, [X, W, b]) def forward(self):
"""
Performs the math behind a linear transform.
"""
X = self.inbound_nodes[0].value
W = self.inbound_nodes[1].value
b = self.inbound_nodes[2].value
self.value = np.dot(X, W) + b def backward(self):
"""
Calculates the gradient based on the output values.
"""
self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes} for n in self.outbound_nodes:
grad_cost = n.gradients[self]
self.gradients[self.inbound_nodes[0]] += np.dot(grad_cost, self.inbound_nodes[1].value.T)
self.gradients[self.inbound_nodes[1]] += np.dot(self.inbound_nodes[0].value.T, grad_cost)
self.gradients[self.inbound_nodes[2]] += np.sum(grad_cost, axis=0, keepdims=False)class Sigmoid(Node):
"""
Represents a node that performs the sigmoid activation function.
"""
def __init__(self, node):
Node.__init__(self, [node]) def _sigmoid(self, x):
"""
This method is separate from `forward` because it
will be used with `backward` as well.
`x`: A numpy array-like object.
"""
return 1. / (1. + np.exp(-x)) def forward(self):
"""
Perform the sigmoid function and set the value.
"""
input_value = self.inbound_nodes[0].value
self.value = self._sigmoid(input_value) def backward(self):
"""
Calculates the gradient using the derivative of
the sigmoid function.
"""
self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes} for n in self.outbound_nodes:
grad_cost = n.gradients[self]
sigmoid = self.value
self.gradients[self.inbound_nodes[0]] += sigmoid * (1 - sigmoid) * grad_costclass Tanh(Node):
def __init__(self, node):
"""
The tanh cost function.
Should be used as the last node for a network.
"""
Node.__init__(self, [node]) def forward(self):
"""
Calculates the tanh.
"""
input_value = self.inbound_nodes[0].value
self.value = np.tanh(input_value) def backward(self):
"""
Calculates the gradient of the cost.
"""
self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes} for n in self.outbound_nodes:
grad_cost = n.gradients[self]
tanh = self.value
self.gradients[self.inbound_nodes[0]] += (1 + tanh) * (1 - tanh) * grad_cost.Tclass MSE(Node):
def __init__(self, y, a):
"""
The mean squared error cost function.
Should be used as the last node for a network.
"""
Node.__init__(self, [y, a]) def forward(self):
"""
Calculates the mean squared error.
"""
y = self.inbound_nodes[0].value.reshape(-1, 1)
a = self.inbound_nodes[1].value.reshape(-1, 1)
self.m = self.inbound_nodes[0].value.shape[0]
self.diff = y - a
self.value = np.mean(self.diff**2) def backward(self):
"""
Calculates the gradient of the cost.
"""
self.gradients[self.inbound_nodes[0]] = (2 / self.m) * self.diff
self.gradients[self.inbound_nodes[1]] = (-2 / self.m) * self.diff
调度算法与优化部分
The optimization part will be explained in detail in the subsequent series. Here I will briefly talk about the algorithm scheduling of graph computing. In fact, each module of Tensorflow will generate a directed acyclic graph, as shown in the following figure During the calculation process, several modules are interdependent. For example, to calculate module 1, modules 3 and 4 must be completed, and to complete module 3, modules 5 and 2 must be completed in sequence before that; therefore, the Kahn algorithm can be used as the scheduling algorithm here (topological_sort below Function), from the calculation graph, deduce a calculation order similar to 5->2->3->4->1. # pythondef topological_sort(feed_dict): """ Sort the nodes in topological order using Kahn's Algorithm. `feed_dict`: A dictionary where the key is a `Input` Node and the value is the respective value feed to that Node. Returns a list of sorted nodes. """ input_nodes = [n for n in feed_dict.keys()] G = {} nodes = [n for n in input_nodes] while len(nodes) > 0: n = nodes.pop(0) if n not in G: G[n] = {'in': set(), 'out': set()} for m in n.outbound_nodes: if m not in G:
G[m] = {'in': set(), 'out': set()}
G[n]['out'].add(m)
G[m]['in'].add(n)
nodes.append(m)
L = []
S = set(input_nodes) while len(S) > 0:
n = S.pop() if isinstance(n, Input):
n.value = feed_dict[n]
L.append(n) for m in n.outbound_nodes:
G[n]['out'].remove(m)
G[m]['in'].remove(n) if len(G[m]['in']) == 0:
S.add(m) return Ldef forward_and_backward(graph):
"""
Performs a forward pass and a backward pass through a list of sorted Nodes.
Arguments:
`graph`: The result of calling `topological_sort`.
"""
for n in graph:
n.forward() for n in graph[::-1]:
n.backward()def sgd_update(trainables, learning_rate=1e-2):
"""
Updates the value of each trainable with SGD.
Arguments:
`trainables`: A list of `Input` Nodes representing weights/biases.
`learning_rate`: The learning rate.
"""
for t in trainables:
t.value = t.value - learning_rate * t.gradients[t]
使用模型
# pythonimport numpy as np
from sklearn.utils import resample
np.random.seed(0)
w1_0 = np.array([[ 0.1, 0.2, 0.3, 0.4],
[ 0.5, 0.6, 0.7, 0.8],
[ 0.9, 1.0, 1.1, 1.2]])
w2_0 = np.array([[ 1.3, 1.4],
[ 1.5, 1.6],
[ 1.7, 1.8],
[ 1.9, 2.0]])
b1_0 = np.array( [-2.0, -6.0, -1.0, -7.0])
b2_0 = np.array( [-2.5, -5.0])
X_ = np.array([[1.0, 2.0, 3.0]])
y_ = np.array([[-0.85, 0.75]])
n_features = X_.shape[1]
W1_ = w1_0
b1_ = b1_0
W2_ = w2_0
b2_ = b2_0
X, y = Input(), Input()
W1, b1 = Input(), Input()
W2, b2 = Input(), Input()
l1 = Linear(X, W1, b1)
s1 = Sigmoid(l1)
l2 = Linear(s1, W2, b2)
t1 = Tanh(l2)
cost = MSE(y, t1)
feed_dict = {
X: X_, y: y_,
W1: W1_, b1: b1_,
W2: W2_, b2: b2_
}
epochs = 10
m = X_.shape[0]
batch_size = 1
steps_per_epoch = m // batch_size
graph = topological_sort(feed_dict)
trainables = [W1, b1, W2, b2]
l_Mat_W1 = [w1_0]
l_Mat_W2 = [w2_0]
l_Mat_out = []
l_val = []
for i in range(epochs):
loss = 0
for j in range(steps_per_epoch):
X_batch, y_batch = resample(X_, y_, n_samples=batch_size)
X.value = X_batch
y.value = y_batch
forward_and_backward(graph)
sgd_update(trainables, 0.1)
loss += graph[-1].value
mat_W1 = []
mat_W2 = []
for i in graph:
try:
if (i.value.shape[0] == 3) and (i.value.shape[1] == 4):
mat_W1 = i.value
if (i.value.shape[0] == 4) and (i.value.shape[1] == 2):
mat_W2 = i.value
except:
pass
l_Mat_W1.append(mat_W1)
l_Mat_W2.append(mat_W2)
l_Mat_out.append(graph[9].value)
来观察一下。当然还有更高级的可视化方法:https://jizhi.im/blog/post/v_nn_learn
# pythonimport matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure( figsize=(14,10))
ax0 = fig.add_subplot(131)#aax0 = fig.add_axes([0, 0, 0.3, 0.1])c0 = ax0.imshow(np.array(l_Mat_out).reshape([-1,2]).T, interpolation='nearest',aspect='auto', cmap="Reds", vmax=1, vmin=-1)
ax0.set_title("Output")
cbar = fig.colorbar(c0, ticks=[-1, 0, 1])
ax1 = fig.add_subplot(132)
c1 = ax1.imshow(np.array(l_Mat_W1).reshape(len(l_Mat_W1), 12).T, interpolation='nearest',aspect='auto', cmap="Reds")
ax1.set_title("w1")
cbar = fig.colorbar(c1, ticks=[np.min(np.array(l_Mat_W1)), np.max(np.array(l_Mat_W1))])
ax2 = fig.add_subplot(133)
c2 = ax2.imshow(np.array(l_Mat_W2).reshape(len(l_Mat_W2), 8).T, interpolation='nearest',aspect='auto', cmap="Reds")
ax2.set_title("w2")
cbar = fig.colorbar(c2, ticks=[np.min(np.array(l_Mat_W2)), np.max(np.array(l_Mat_W2))])
ax0.set_yticks([0,1])
ax0.set_yticklabels(["out0", "out1"])
ax1.set_xlabel("epochs")#for i in range(len(l_Mat_W1)):
我们注意到,随着训练轮数 Epoch 不断增多, Output 值从最初的 [0.72, -0.88] 不断接近 y = [-0.85, 0.72], 其背后的原因,
是模型参数不断的从初始化的值变化、更新,如图中的 w1 w2 两个矩阵。
好了,最简单的轮子已经造好了。 我们的轮子,实现了 Input Linear Sigmoid Tanh 以及 MSE 这几个模块。
转自 雷课interpolation='nearest',aspect='auto', cmap="Reds")
ax1.set_title("w1")
cbar = fig.colorbar(c1, ticks=[np.min(np.array(l_Mat_W1)), np.max(np.array(l_Mat_W1))])
ax2 = fig.add_subplot(133)
c2 = ax2.imshow(np.array(l_Mat_W2).reshape(len(l_Mat_W2), 8).T, interpolation='nearest',aspect='auto', cmap="Reds")
ax2.set_title("w2")
cbar = fig.colorbar(c2, ticks=[np.min(np.array(l_Mat_W2)), np.max(np.array(l_Mat_W2))])
ax0.set_yticks([0,1])
ax0.set_yticklabels(["out0", "out1"])
ax1.set_xlabel("epochs")#for i in range(len(l_Mat_W1)):
我们注意到,随着训练轮数 Epoch 不断增多, Output 值从最初的 [0.72, -0.88] 不断接近 y = [-0.85, 0.72], 其背后的原因,
是模型参数不断的从初始化的值变化、更新,如图中的 w1 w2 两个矩阵。
好了,最简单的轮子已经造好了。 我们的轮子,实现了 Input Linear Sigmoid Tanh 以及 MSE 这几个模块。
转自 雷课interpolation='nearest',aspect='auto', cmap="Reds")
ax1.set_title("w1")
cbar = fig.colorbar(c1, ticks=[np.min(np.array(l_Mat_W1)), np.max(np.array(l_Mat_W1))])
ax2 = fig.add_subplot(133)
c2 = ax2.imshow(np.array(l_Mat_W2).reshape(len(l_Mat_W2), 8).T, interpolation='nearest',aspect='auto', cmap="Reds")
ax2.set_title("w2")
cbar = fig.colorbar(c2, ticks=[np.min(np.array(l_Mat_W2)), np.max(np.array(l_Mat_W2))])
ax0.set_yticks([0,1])
ax0.set_yticklabels(["out0", "out1"])
ax1.set_xlabel("epochs")#for i in range(len(l_Mat_W1)):
我们注意到,随着训练轮数 Epoch 不断增多, Output 值从最初的 [0.72, -0.88] 不断接近 y = [-0.85, 0.72], 其背后的原因,
是模型参数不断的从初始化的值变化、更新,如图中的 w1 w2 两个矩阵。
好了,最简单的轮子已经造好了。 我们的轮子,实现了 Input Linear Sigmoid Tanh 以及 MSE 这几个模块。
转自 雷课1]) ax0.set_yticklabels(["out0", "out1"]) ax1.set_xlabel("epochs")#for i in range(len(l_Mat_W1)): We noticed that as the number of training rounds Epoch increases, the Output value approaches y = [0.72, -0.88] from the initial value [-0.85, 0.72], The reason behind this is that the model parameters are constantly changing and updating from the initialization values, such as the two matrices w1 w2 in the figure. Okay, the simplest wheel has been built. Our wheel implements the Input Linear Sigmoid Tanh and MSE modules. Reprinted from Lei Ke1]) ax0.set_yticklabels(["out0", "out1"]) ax1.set_xlabel("epochs")#for i in range(len(l_Mat_W1)): We noticed that as the number of training rounds Epoch increases, the Output value approaches y = [0.72, -0.88] from the initial value [-0.85, 0.72], The reason behind this is that the model parameters are constantly changing and updating from the initialization values, such as the two matrices w1 w2 in the figure. Okay, the simplest wheel has been built. Our wheel implements the Input Linear Sigmoid Tanh and MSE modules. Reprinted from Lei Ke
|