What is the ultimate form of a deep learning engine?

Latest update time：2018-01-24

Reads：

First, let us use our imagination to imagine what an ideal deep learning engine should look like, or what the ultimate form of a deep learning engine is, and see what inspiration this will bring to the development of deep learning frameworks and AI-specific chips.

Note: The speaker believes that the essence of this report is in Part 1 and Part 5. Parts 2 and 3 have been shared at GIAC (Global Internet Architecture Conference) and AICon (Global Artificial Intelligence Technology Conference). This article only annotates the new content. Readers interested in Parts 2 and 3 can go to this public account to view historical articles.

Note: Taking the well-known convolutional neural network CNN as an example, you can get a feel for how much computing power is currently required to train deep learning models. This table lists the memory capacity and floating-point calculation times required for common CNN models to process an image. For example, the VGG-16 network requires 16Gflops to process an image. It is worth noting that based on the ImageNet dataset for training CNN, the dataset has a total of about 1.2 million images. The training algorithm needs to scan this dataset 100 times (epoch), which means 10^18 floating-point calculations, or 1 exaFlops. A simple calculation shows that it takes several years to train such a model based on a CPU core with a main frequency of 2.0GHz.

Note: The above figure lists several of the most commonly used computing devices, including CPU, GPU, TPU, etc. As we all know, GPU is now the most widely used computing device in the field of deep learning. TPU is said to be more powerful than GPU, but currently only Google can use it. We can discuss why CPU < GPU < TPU, and whether there are hardware devices more powerful than TPU. A single-core CPU with a main frequency of 2GHz can only execute instructions serially, and can perform tens to hundreds of millions of operations per second. With the end of Moore's Law, people have improved computing power by integrating more cores on a CPU. For example, integrating 20 computing cores on a CPU (the so-called multi-core) can increase the computing power of the CPU by dozens of times. GPU goes a step further than multi-core, using many-core, integrating thousands of computing cores on a chip. Although the main frequency of each core is lower than that of the CPU core (usually less than 1GHz), the parallelism is still increased by a hundred times, and the memory access bandwidth is more than 10 times higher than that of the CPU. Therefore, the throughput of dense computing can reach 10 times or even 100 times that of the CPU. GPUs are criticized for their high power consumption. To solve this problem, dedicated AI chips such as TPUs have emerged. Dedicated chips can integrate more computing units required for deep learning in the same area, and even use dedicated circuits to implement certain specific operations, making it take less time to complete the same calculation (see https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu for the secrets of TPU). Are there dedicated chips faster than TPU? Of course there are. In extreme cases, if a chip is specially implemented for any neural network regardless of cost, it must be much more efficient than a chip like TPU that supports the most common types of neural networks.

Note: There are many reasons why dedicated hardware is faster than general-purpose hardware (such as CPU, GPU), mainly including: (1) General-purpose chips generally go through the steps of "fetch-decode-execute" (even including "fetch data") to complete an operation, while dedicated hardware greatly reduces the overhead of "fetch-decode" and executes as soon as the data arrives; (2) The control circuit complexity of dedicated hardware is low, and more devices useful for operation can be integrated in the same area, and operations that general-purpose hardware needs thousands or tens of thousands of clock cycles to complete can be completed in one clock cycle; (3) Both dedicated hardware and general-purpose hardware support pipeline parallelism, and the hardware utilization rate is high; (4) Dedicated hardware has high on-chip bandwidth, and most of the data is transmitted on-chip. Obviously, if physical reality is not taken into account, no matter what neural network it is, no matter how large the scale of the problem is, it is the most efficient approach to implement a set of dedicated hardware. The question is, does it work?

Note: If a dedicated hardware is implemented for any neural network, the operation efficiency is the highest, but the development efficiency is not high. Once the requirements change (neural network topology, number of layers, number of neurons), the circuit needs to be redesigned, and the hardware development cycle is notoriously long. This reminds people of electronic computers before von Neumann invented the "stored program" computer (the picture above is the photo of the first electronic computer ENIAC). The functions of the computer are realized through hard-wired circuits. To change the function of the computer, it is necessary to reorganize the connections between devices. This "programming" method is slow and difficult to debug.

Note: The infinitely large dedicated hardware we just imagined obviously faces several practical problems: (1) The chip cannot be infinitely large, and the limitations of the hardware manufacturing process must be considered (heat dissipation, clock signal propagation range, etc.); (2) The hard-wired circuit has poor flexibility, and changing the function requires rewiring; (3) After changing the wiring, the pipeline scheduling mechanism may need to be adjusted accordingly to maximize the hardware utilization. Therefore, the "infinitely large" dedicated hardware we imagined "regardless of cost" faces severe challenges. How can we overcome them?

Note: In reality, both general-purpose hardware (such as GPU) and specialized hardware (such as TPU) can be connected together through high-speed interconnection technology, and multiple devices can be coordinated through software to complete large-scale computing. Using the most advanced interconnection technology, the transmission bandwidth between devices can reach 100Gbps or more, which is one or two orders of magnitude lower than the internal bandwidth of the device. Fortunately, if the software is "properly deployed", the hardware calculation may be saturated under this bandwidth condition. Of course, the technical challenge of "proper deployment" is extremely great. In fact, the faster the speed of a single device, the more difficult it is to "properly deploy" multiple devices.

Note: Currently, the stochastic gradient descent algorithm (SGD) is widely used in deep learning. Generally, it only takes 100 milliseconds for a GPU to process a small piece of data. So the key to the problem is whether the "deployment" algorithm can prepare for the GPU to process the next piece of data within 100 milliseconds. If it can, then the GPU will remain in the computing state. If not, then the GPU will pause intermittently, which means that the device utilization rate is reduced. In theory, it is possible. There is a concept called arithmetic intensity, that is, flops per byte, which indicates the amount of calculations that occur on one byte of data. As long as this amount of calculation is large enough, it means that transmitting one byte can consume enough calculations. Even if the transmission bandwidth between devices is lower than the internal bandwidth of the device, it is possible to make the device in a full load state. Furthermore, if a device faster than the GPU is used, the time to process a piece of data will be lower than 100 milliseconds, for example 10 milliseconds. Under given bandwidth conditions, can the "deployment" algorithm prepare for the next calculation in 10 milliseconds? In fact, even with a GPU that is not that fast (compared to dedicated chips such as TPU), the current mainstream deep learning framework is already unable to cope with certain scenarios (such as model parallelism).

Note: A general deep learning software framework must be able to "deploy" hardware most efficiently for any given neural network and available resources. This requires solving three core problems: (1) resource allocation, including the allocation of three types of resources: computing cores, memory, and transmission bandwidth, which requires comprehensive consideration of locality and load balancing; (2) generating correct data routing (equivalent to the connection problem between dedicated hardware imagined in the previous article); (3) efficient operation mechanism, perfect coordination of data handling and computing, and maximum hardware utilization. In fact, these three problems are very challenging, and this article will not discuss their solutions. Assuming that we can solve these problems, what benefits will there be?

Note: Assuming we can solve the three software problems mentioned above, we can have the best of both worlds: flexibility in software and high efficiency in hardware. Given a deep learning task, users can enjoy the performance of "infinite dedicated hardware" without reconnecting. What's even more exciting is that when this software is implemented, dedicated hardware can be simpler and more efficient than all current AI chips. Readers can first imagine how to achieve this beautiful prospect. In the near future, we will also share some OneFlow design ideas on the official account to see if we have the same idea.

Note: Let us reiterate a few points: (1) Software is really critical; (2) We are more interested in optimization at the macro level (between devices); (3) There is an ideal implementation of deep learning frameworks, just like the roundest circle in Plato's mind, but of course the existing deep learning frameworks are still far from it; (4) Companies from all walks of life, as long as they have data-driven businesses, will eventually need their own "brain", and this "brain" should not be exclusive to a few giant companies.

The content of this article is quite imaginative, and everyone is welcome to leave comments and communicate!

Latest articles about

■ITC finally ruled against Innoscience; Biden is anxious about chip subsidies; ADI acquires eFPGA

■Selection of the latest chip product technology content

■Littelfuse's new products enable electronic products to be safe, reliable and efficient. 10+ challenges are waiting for you to explore!

■How did I end up on the path of no return to hardware design?

■Diode clamp circuit: principle and application detailed explanation

■2024 Renesas Electronics MCU/MPU Industrial Technology Seminar - Shenzhen and Shanghai, registration is now open, and there are great gifts for registration and attendance!

■You may not understand these performance parameters of LDO

■If Trump is elected as the US President, will he repeal the "CHIP Act"?

■Video tutorial | Application of audio and video entertainment system for commercial vehicles

■What exactly is the 48V automotive system that we talk about every day?