Breaking the memory wall and power consumption wall, the present and future of domestic chip AI-NPU-EEWORLD

Collect

With the implementation of 5G, the cost-effectiveness of the Internet of Things has become apparent, and the evolutionary trends of industrial digitization and urban intelligence have become increasingly obvious. More and more companies and cities have begun to add disruptive concepts such as digital twins to IoT innovations to improve productivity and efficiency, reduce costs, and accelerate the construction of new smart cities. It is worth mentioning that digital twin technology has been written into the country's "14th Five-Year Plan", providing national strategic guidance for the construction of digital twin cities.

Regarding digital twins, we can give an example of unmanned retail concept stores launched by Amazon and JD.com a few years ago, which turned offline retail stores into online Taobao stores. Before people go shopping in the store, they only need to open the APP and complete the face recognition login in the settings. After the face recognition is successful, the account can be automatically linked when the face is scanned to open the door. After shopping, there is no need to queue up to check out manually, and you can leave by just scanning your face. It seems that there is no one to manage, but behind it is the full tracking of artificial intelligence. Every move of the consumer is captured by the camera. For example, if you pick up a product and look at it again and again, it means that you are very interested in this product, but you did not buy it due to some concerns, and finally bought another product. Such data will be captured and analyzed in depth to form a basic database, and then it can be pushed periodically based on all your shopping records and consumption habits.

Through this example, we can see the convenience brought by digitizing the physical world. And vision is an important means for humans to perceive the world. The foundation for humans to enter the intelligent society is digitization, and perception is the premise of digitizing the physical world. The type, quantity and quality of front-end visual perception determine the degree of intelligence of our society. It can be seen that the foundation of the intelligent future is "perception + computing". AI vision will play a very critical role in the process of intelligence and has a very broad application prospect. Some industry analysts believe that digital twin technology is about to surpass manufacturing and enter the integration fields of the Internet of Things, artificial intelligence and data analysis. This is why we chose this entrepreneurial direction.

As the most important entrance from the physical world to the digital twin world, visual chips are receiving widespread attention, especially AI visual perception chips that can restore the physical world by 80%-90%.

So what is an AI visual perception chip? From the perspective of demand, an AI visual perception chip needs to have two major functions: one is to see clearly, and the other is to understand. The AI-ISP is responsible for seeing clearly, and the AI-NPU is responsible for understanding.

Figure | Technical features of AI vision chips

In fact, in a broad sense, any chip that can achieve AI acceleration in artificial intelligence applications can be called an AI chip, and the module used to improve the efficiency of AI algorithm operation is often called an NPU (neural network processor). At present, AI vision chips accelerated by NPU have been widely used in the fields of big data, intelligent driving and image processing.

According to the latest data released by IDC, the size of the accelerated server market reached US$5.39 billion in 2021, a year-on-year increase of 68.6%. Among them, GPU servers dominate with a market share of 90%, and non-GPU accelerated servers such as ASIC and FPGA have a market share of 11.6% with a growth rate of 43.8%, reaching US$630 million. This means that the application of neural network processors NPU has moved out of the early pilot stage and is becoming a key requirement in the artificial intelligence business. So, today we will talk about the AI-NPU responsible for "seeing more clearly" and "understanding".

Why is it said that seeing more clearly is also related to AI-NPU? From the perspective of people's intuitive feelings, "seeing clearly" is easy to understand. For example, at night we want to see things more clearly, but the pictures taken by traditional cameras are often overexposed and the color details are submerged. At the same time, there will be noise around moving people and distant buildings. So, in a situation like this, how can we better achieve "seeing clearly"? In fact, the visual chip cannot "see clearly" without the support of AI-NPU's high computing power.

Figure | Night video effect comparison

Taking smart cities as an example, we have used 5-megapixel cameras for intelligent analysis. Traditional video quality improvement uses traditional ISP technology. In low-light scenes, there will be a lot of noise. Using AI-ISP can solve this problem and still provide clear pictures in low-light scenes. However, when using AI-ISP technology, the video must be processed with AI algorithms at full resolution and full frame rate, and it cannot be processed by taking shortcuts such as reducing resolution or skipping frames, because the human eye is very sensitive to flickering of image quality. And for a 5-megapixel video stream, to achieve full resolution and full frame rate processing, it will place very high demands on the computing power of the NPU.

In intelligent analysis scenarios, such as vehicle detection and license plate recognition, it is common to use a 5-megapixel camera to record a video at a frame rate of 30fps, and then perform a detection every 3/5 frames. When performing the detection, the resolution is reduced to 720P. This method will not be able to recognize license plates that are far away in the video, and may miss vehicles traveling at high speeds. The solution is to try to use full-resolution and higher frame rate detection methods for processing, which also places very high demands on the computing power of the NPU.

In addition, as mentioned earlier, in addition to seeing clearly, we also need to understand. The so-called understanding means to do intelligent analysis, and to do intelligent analysis we also need the support of AI-NPU's large computing power. We can look at this issue from two perspectives.

First of all, we know that AI itself is a tool to improve efficiency, and it will eventually fall into the scene, which is the concept of early AI+ and the recent +AI. So, when AI falls into the industry, what can it do? In fact, AI can do a lot of things, such as replacing the expert systems of some industries with neural networks. This is equivalent to installing such an "expert" in our AI chip. This expert system must be smart enough, which corresponds to a smarter or larger network. A larger network is equivalent to a larger brain capacity, which can maintain and store more weight values, which will put high demands on the NPU computing power.

Secondly, from the perspective of deployment, most of our model training is currently run on servers with high computing power, while deployment is on devices with limited computing power. Only by reducing the amount of computing of the model or algorithm to the extent that the end-side can run can it be better implemented on the application side. Therefore, the model compression process is required, and model compression has high technical requirements for technicians. If the computing power of our end-side is relatively high, this process can actually be shortened. This is similar to the process of developing embedded software. In the early days, we were limited by the computing power bottleneck. In order to run more functions, we needed to squeeze the performance of the hardware very seriously, so we used assembly to write programs, but if the computing power was relatively high, we could use C language for development. In other words, it is feasible to use part of the computing power in exchange for improved development efficiency and accelerated AI implementation, but this approach in turn increases the requirements for NPU computing power.

Above, we analyzed the driving force behind why AI visual perception chip companies want to develop high-performance, high-computing-power NPUs. However, it is very difficult to truly develop chips with high computing power.

As we all know, computing power is an important indicator of NPU performance. However, the computing power of many early AI chips is actually a nominal value, and they cannot achieve the nominal performance when actually used. For example, a computing power of 1T is claimed, but in actual operation, it is found that only 200G or 3~400G can be used. Therefore, people now use more practical FPS/W or FPS/$ as an evaluation indicator to measure the running efficiency of advanced algorithms on computing platforms.

Figure | Design difficulties and driving forces of AI-NPU

In the field of autonomous driving, when Tesla released the FSD chip in 2017, Musk compared FSD with Nvidia Drive PX2, which was previously used on Tesla, and said: " From the perspective of computing power, FSD is three times that of Drive PX2, but when performing autonomous driving tasks, its FPS is 21 times that of the latter ."

In the field of AI vision chips, Aixin Yuanzhi released the first high-performance, low-power artificial intelligence vision processor chip AX630A. When comparing the running speeds of different neural networks under public data sets, the processing frames per second were 3116 and 1356 respectively, far exceeding other similar chip products, and the power consumption was only about 3W.

Figure | AX630A product block diagram

What exactly has widened the gap in the utilization of these NPUs? Behind this is actually the problem of memory wall and power consumption wall. The so-called memory wall means that when we increase the computing power index by stacking MAC units, the data bandwidth must keep up, otherwise the insufficient data supply capacity will cause the MAC unit to constantly wait for data, and the processing performance will decrease. The problem of power consumption wall mainly comes from two aspects: MAC unit and DDR. When we increase the computing power index by stacking MAC units, the total power consumption of the MAC unit itself will increase, and high bandwidth support is also required. The more expensive HBM can be used on the server side, so the power consumption required for DDR is bound to increase. On the end side, due to cost considerations, there is no particularly good DDR solution.

[1] [2]

Keywords：Memory Reference address：Breaking the memory wall and power consumption wall, the present and future of domestic chip AI-NPU

Previous article：8-bit MCUs in the IoT: Simplifying advanced architecture interfaces with traditional chips
Next article：Unveiling the IPU design philosophy: Designing processors from the customer’s perspective

Popular Resources
Popular amplifiers