Read an AI paper from scratch

Latest update time：2021-09-05 15:03

Reads：

Author/sharer: Li Jiaxuan, author of "TensorFlow Technical Analysis and Practice", lecturer at InfoQ, 51CTO, Oreilly Strata and other conferences, active in major domestic technology communities, and Zhihu programming question answerer. Good at studying the architecture, source code analysis and application of deep learning frameworks in different fields. Has practical experience in deep learning such as processing images, sentiment analysis of social text data, and data mining. He has participated in the Hackathon competition of autonomous driving two-dimensional perception system based on deep learning. He used to work as a R&D engineer at Baidu. Now he is studying NLP, ChatBot, and performance optimization and FPGA compilation of TensorFlow.

In the first part, I will first explain how to read a machine learning paper from scratch and how to deal with mathematical problems in the paper. Then, starting from a classic paper, I will explain how to quickly sort out and understand a deep learning framework and model.

There are a lot of papers on artificial intelligence and machine learning recently. So how does an engineer with an engineering background and little academic experience or some experience read a paper related to artificial intelligence?

In the beginning of my academic exploration, I tended to read the whole article carefully, especially the classic papers in the field of deep learning, but I found that this method took too much time, which squeezed my real purpose - engineering implementation and engineering integration. Moreover, because there were too many things I wanted to grasp, I didn't grasp the core of an article, which made it easy to forget. For example, I forgot the article I read yesterday like drinking water.

I will discuss this with you from two aspects.

1. Start from scratch and read the level of a paper

Starting from scratch here means that we need to understand from scratch what the article did, what methods it used, what results it got, and whether such methods and results can provide any reference for me.

It is not that when you come into contact with a completely new field, you start by reading papers. For unfamiliar fields that you have never been exposed to, my method is to read Chinese reviews, Chinese doctoral dissertations, and then English reviews. Through the Chinese reviews, you can first understand the basic terms and common experimental methods in this field. Otherwise, if you start directly from the paper, the author's height is inconsistent with your level, and it is easy to take it for granted or simply not read it. Therefore, before reading this article, you should have a thorough understanding of the basic knowledge involved in this article and the corresponding Chinese foundation.

At this point, we return to the state of understanding the article from scratch. There are often three increasing levels of reading an article:

Level 1: Understand the summary of the article (5-10 minutes)

Read the title, abstract, and introduction carefully.
Read only the section and sub-section headings and skip the specific contents.
Understand the conclusion and discussion (the author usually discusses the shortcomings and deficiencies of the study here, provides suggestions for future research, and points out directions).
Browse the references and take note of the papers you have already read.

Therefore, after the first level, you should be able to answer the following 5 questions:

Article categories: Articles about implementation methods? Articles analyzing existing systems? Articles describing research theories?
Content: Is there any corresponding related paper? What theoretical bases is this article based on?
Are the article's assumptions actually correct?
Contribution: Does this article make significant progress in terms of effect (state of art)? Or is it innovative in method? Or does it improve the basic theory?
Clarity: Is the article clearly descriptive?

After completing the first level, you can decide whether to go deeper into the second level. It is enough to reserve knowledge for when you want to use it someday, rather than starting immediately.

Level 2: Grasp the content of the article and ignore the details (1 hour)

The second level needs to be read carefully and grasp the key points:

Understand the meaning of graphs and tables and the conclusions they support.
Note any unread literature in the references that you think is important and will give you a deeper understanding of the context of the article.

To complete the second level, you need to know what evidence the article uses and how it proves a certain conclusion.

Especially at this level, if you encounter something you cannot understand (there are many reasons: too many formulas, lack of understanding of terminology, unfamiliarity with experimental methods, too many references), it means that you are not on the same page as the author. It is recommended to start with a few important references to supplement background knowledge.

Level 3: In-depth understanding of the text (5-6 hours)

If you want to apply this article to your current project, you need level 3. The goal is to be able to re-implement the paper under the same assumptions.

At the same time, pay attention to the corresponding code of the paper on GitHub. Jumping into the program can speed up your understanding.

By comparing your reproduced results with the original paper, you can truly understand the innovation of an article and its implicit premise or hypothesis. And you can get some directions for your future work from the reproduction process.

The benefit of doing these three levels is that it allows you to have a reasonable estimate of the time it takes to read an article, and you can even adjust the depth of your grasp of an article based on time and your work needs.

How to read a math-heavy machine learning paper

This is very common in many AI papers, so generally speaking, at the first level, you don’t need to understand all the steps of a formula. Try to skip the formula, read the text description, read the experimental results, and read the conclusion.

As you accumulate mathematics knowledge in your daily work, at the second level, you may be able to truly understand the author's purpose and steps directly by looking at the formulas.

If you really want to go to the third level, you may need to follow the article to do some derivation. But in fact, if there is a ready-made code implementation, you can better understand the mathematical process from an engineering perspective.

Finally, I suggest that you try to read the 128 papers in the areas that interest you according to your needs and adjust the reading level. (The author has just finished reading, welcome to share with us)

128 papers, machine learning papers in 21 major fields

The following two points are explained in conjunction with the TensorFlow architecture and system design paper "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems":

TensorFlow programming model and basic concepts

Explain the static graph model in less than 20 lines of code.

TensorFlow operates in four steps:

Load data and define hyperparameters;
Build a network;
Train the model;
Evaluate models and make predictions.

Let's take a neural network as an example to explain how TensorFlow works. In this example, we construct raw data that satisfies the quadratic function y = ax2+b, and then build a simplest neural network that only contains an input layer, a hidden layer, and an output layer. Use TensorFlow to learn the values of weights and biases of the hidden layer and output layer, and see if the loss value continues to decrease as the number of training times increases.

Generating and loading data

First, let's generate input data. We assume that the final equation to be learned is y = x2 − 0.5. Let's construct a bunch of x and y that satisfy this equation and add some noise points that do not satisfy the equation.

import tensorflow as tf
import numpy as np
# 编造满足一元二次方程的函数
x_data = np.linspace(-1,1,300)[:, np.newaxis] # 为了使点更密一些，我们构建了300个点，分布在-1到1区间，直接采用np生成等差数列的方法，并将结果为300个点的一维数组，转换为300×1的二维数组
noise = np.random.normal(0, 0.05, x_data.shape) # 加入一些噪声点，使它与x_data的维度一致，并且拟合为均值为0、方差为0.05的正态分布
y_data = np.square(x_data) - 0.5 + noise # y = x^2 – 0.5 + 噪声

Next, define placeholders for x and y as the variables that will be fed into the neural network:

xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])

Building a network model

Here we need to build a hidden layer and an output layer. As a layer in a neural network, the input parameters should have 4 variables: input data, input data dimension, output data dimension and activation function. Each layer is vectorized (y = weights×x + biases) and processed by the nonlinear activation function to finally get the output data.

Let's define the hidden layer and output layer. The sample code is as follows:

def add_layer(inputs, in_size, out_size, activation_function=None):
  # 构建权重：in_size×out_size大小的矩阵
  weights = tf.Variable(tf.random_normal([in_size, out_size])) 
  # 构建偏置：1×out_size的矩阵
  biases = tf.Variable(tf.zeros([1, out_size]) + 0.1) 
  # 矩阵相乘
  Wx_plus_b = tf.matmul(inputs, weights) + biases 
  if activation_function is None:
     outputs = Wx_plus_b
  else:
     outputs = activation_function(Wx_plus_b)
return outputs # 得到输出数据
# 构建隐藏层，假设隐藏层有10个神经元
h1 = add_layer(xs, 1, 20, activation_function=tf.nn.relu)
# 构建输出层，假设输出层和输入层一样，有1个神经元
prediction = add_layer(h1, 20, 1, activation_function=None)

Next, we need to construct the loss function: calculate the error between the predicted value and the true value of the output layer, sum the squares of the difference between the two and take the average to get the loss function. Use the gradient descent method to minimize the loss with an efficiency of 0.1:

# 计算预测值和真实值间的误差
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),
                      reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)

Training the model

We let TensorFlow train 1000 times and output the training loss value every 50 times:

init = tf.global_variables_initializer() # 初始化所有变量
sess = tf.Session()
sess.run(init)

for i in range(1000): # 训练1000次
  sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
  if i % 50 == 0: # 每50次打印出一次损失值
    print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

Basic implementation of TensorFlow

Includes: devices, distributed operation mechanism, cross-device communication, and gradient calculation.

TensorFlow has two distributed modes, data parallelism and model parallelism. The most commonly used one is data parallelism. The principle of data parallelism is very simple, as shown in the figure. The CPU is mainly responsible for gradient averaging and parameter updates, while GPU1 and GPU2 are mainly responsible for training model replicas. They are called "model replicas" here because they are all trained based on a subset of training samples, and there is a certain degree of independence between the models.

The specific training steps are as follows.

Define the model network structure on GPU1 and GPU2 respectively.
For a single GPU, different data blocks are read from the data pipeline, and then forward propagation is performed to calculate the loss and then the gradient of the current variable.
Transfer all the gradient data output by the GPU to the CPU, perform gradient averaging first, and then update the model variables.
Repeat steps 1 to 3 until the model variables converge.

The purpose of data parallelism is to improve the efficiency of SGD. For example, if the size of each SGD mini-batch is 1,000 samples, then if it is cut into 10 parts, each with 100 samples, and then the model is copied 10 times, calculations can be performed on 10 models at the same time.

However, because the calculation speeds of the 10 models may be inconsistent, some are fast and some are slow, when the CPU updates the variables, should it wait for the entire mini-batch to be calculated, and then update by summing and averaging, or should it update the part that has been calculated first, and then update the part that has been calculated later, and then overwrite the previous one?

This brings up the question of synchronous and asynchronous updates.

Distributed stochastic gradient descent means that model parameters can be distributed and stored on different parameter servers, and working nodes can train data in parallel and communicate with parameter servers to obtain model parameters. There are two ways to update parameters: synchronous and asynchronous, namely asynchronous stochastic gradient descent (Async-SGD) and synchronous stochastic gradient descent (Sync-SGD). As shown in the figure:

The synchronous stochastic gradient descent method (also known as synchronous update and synchronous training) means that during training, the work tasks on each node need to read in shared parameters and perform parallel gradient calculations. Synchronization requires waiting for all working nodes to calculate the local gradients, and then merge and accumulate all shared parameters, and then update them to the model parameters at one time; in the next batch, all working nodes get the updated model parameters and then train again.

The advantage of this solution is that each training batch takes into account the training conditions of all working nodes, and the loss decreases relatively steadily; the disadvantage is that the performance bottleneck lies on the slowest working node. In heterogeneous devices, the performance of working nodes is often different, and this disadvantage is very obvious.

Asynchronous stochastic gradient descent (also known as asynchronous update or asynchronous training) means that the tasks on each working node independently calculate the local gradient and asynchronously update it to the model parameters without the need for coordination or waiting.

The advantage of this solution is that there is no performance bottleneck; the disadvantage is that when the gradient values calculated by each working node are sent back to the parameter server, there will be a conflict in parameter updates, which will affect the convergence speed of the algorithm to a certain extent, and there will be large jitter during the loss reduction process.

How to choose synchronous update or asynchronous update? Is there any optimization method?

The difference between synchronous update and asynchronous update lies in the strategy of updating the parameters of the parameter server. When the data volume is small and the computing power of each node is relatively balanced, it is recommended to use the synchronous mode; when the data volume is large and the computing performance of each machine is uneven, it is recommended to use the asynchronous mode. The specific use of which can also be seen from the experimental results. Generally, the asynchronous update effect will be better when the data volume is large enough.

The following shows how to create a TensorFlow server cluster and how to distribute computing a static graph in the cluster.

All nodes in a TensorFlow distributed cluster execute the same code. Distributed task code has a fixed structure:

# 第1步：命令行参数解析，获取集群的信息ps_hosts和worker_hosts，
以及当前节点的角色信息job_name和task_index。例如:
tf.app.flags.DEFINE_string("ps_hosts", "", "Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "", "Comma-separated list of hostname:port 
pairs")
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
FLAGS = tf.app.flags.FLAGS
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts(",")

# 第2步：创建当前任务节点的服务器
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)

# 第3步：如果当前节点是参数服务器，则调用server.join()无休止等待；如果是工作节点，则执行第4步
if FLAGS.job_name == "ps":
  server.join()

# 第4步：构建要训练的模型，构建计算图
elif FLAGS.job_name == "worker":
# build tensorflow graph model

# 第5步：创建tf.train.Supervisor来管理模型的训练过程
# 创建一个supervisor来监督训练过程
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), logdir="/tmp/train_logs")
# supervisor负责会话初始化和从检查点恢复模型
sess = sv.prepare_or_wait_for_session(server.target)
# 开始循环，直到supervisor停止
while not sv.should_stop()
   # 训练模型

Take the above code framework to perform distributed training on the MNIST data set. See the code: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py#L1

The second part is about how to convert your requirements into a description of the paper and implement it.

Take the recommendation system as an example:

Reference article: https://www.textkernel.com/building-large-knowledge-graph-recruitment-domain/

We take the construction of the knowledge base in a recruitment recommendation system as an example to explain where and how to introduce NLP and knowledge graphs.

(1) Why build a knowledge base? The following is a search based on a knowledge base:

That is, we hope to structure the corresponding job description into a knowledge graph:

We know that knowledge graphs include entities and entity relationships. In the context of recruitment, the entity database should include: position database, career database, resume database, and entity vocabulary database. Entity relationships may include attribution relationships, hierarchical relationships, and association relationships.

Let's do a structured extraction of the job description and design a labeling system for entity relationships, as follows:

How to extract it specifically?

Find the anchor words and punctuation marks and split them into short sentences.

Position 内容: (salesman/apprentice); 负责: bar counter, follow the master to mix drinks, cut fruit plates and snacks; 待遇: regular employees have a basic salary of 3,000-3,500 yuan/month + bonus + five insurances and one fund, and the company provides food and accommodation; 工作地点: the company arranges the nearest work location based on the employee's residence.
Locate the core content from short sentences based on characteristic words/phrases

(Salesperson/Apprentice) At the bar, follow the chef to make drinks, 切配 fruit plates, and snacks Regular employee basic salary Yuan/month + bonus 3000-3500 + 五 insurance 一 最近 The company arranges work according to the employees' residence 地点
Core word extraction

Salesperson Apprentice Bar counter beverage preparation, fruit plate and snacks Basic salary 3000-3500 yuan/month Bonus Five insurances and one fund Nearest place of work

So how do you find positioning words in this process? Generally, it is divided into 3 steps:

(1) Positioning word -> seed word -> positioning word. For example:

(2) Based on part-of-speech tagging. For example:

Segment the text, perform part-of-speech tagging, and search for the words in it 动词、数词、量词 as positioning words

(3) Based on grammar. For example:

Nouns and abbreviations following a verb
The frequency of joint co-occurrence of phrases is high. Combinations of verb + adjective, verb + adverb

For details about part-of-speech tagging, see the Chinese part-of-speech tagging set: https://gist.github.com/luw2007/6016931

Finally, a recruitment knowledge base was established:

Finally, I hope everyone can read more papers, summarize and review the articles they have read, and practice more on TensorFlow in combination with the open source implementation on GitHub. The accumulation of more papers in a field can reveal many existing problems and opportunities.

Latest articlesabout

■CPU cache consistency: from theory to practice

■Throw some cold water on the cunning Hongmeng

■The process of receiving network data packets

■Let's talk about the current AI and a bunch of other things in plain language

■Vomiting blood sorting | Liver over Linux interrupt all knowledge points

■Introduction to Linux V4L2 subsystem and video codec equipment

■Arm64 stack backtrace

■Unbeatable! I strongly recommend taking the software exam this year!

■Domestic real-time operating system: real-time comparison with RT-Linux and Zephyr