Exclusive in-depth analysis | Why is Arm's AI processor so late?
▲Click above Leifeng.com Follow
Arm has launched a series of new IPs, including NPU, GPU and DPU, and NPU is particularly worthy of attention.
Text | Bao Yonggang
Last week, Arm launched a series of new IPs, including NPU, GPU and DPU. NPU is particularly noteworthy, not only because the NPU series also released two new products, N57 and N37, but also because the name of Arm's ML processor (Machine Learning Processor) series, Ethos, was officially announced. The launch of the new AI series products means that Arm's AI strategy is clearer.
However, the competition for AI processors has begun in the mobile phone market since 2017, with Huawei, Apple, Samsung, MediaTek, and Qualcomm all launching mobile phone processors with integrated NPUs. Why did Arm wait until 2019 to launch NPU? Can Arm's NPU be successful?
In fact, Arm originally planned to release its first ML processor in the first quarter of 2019. The reason for choosing this time point was that Dennis Laudick, vice president of business and marketing of the Arm ML business group, said in an exclusive interview with Leifeng.com in November last year: "Recently, we have seen that machine learning technology is stabilizing and maturing, and market demand is also increasing. We think now is the best time to enter the market. "
However, Arm's first ML processor was released in May this year, slightly later than planned. Dennis accepted another interview with Leifeng.com last week and said that there are many reasons for the delay in release, and product development is the main problem. There are many challenges in the development of ML processors. One of the big challenges is that if the data movement is not handled well, it will consume a lot of electricity. Another big challenge is how to balance efficiency and flexibility.
Although product development affected the release date of the first ML processor, Arm did not announce the series name and model when the first ML product was released. It was not until the release of the N57, which provides a balance of performance, cost and power consumption for the mainstream market, and the N37, which is extremely cost-sensitive, that Arm officially announced the name of the ML series processor - Ethos (which can be translated as "spirit" in Chinese). At this time, the outside world also knew that the first ML processor model released in May was Ethos-N77, which was positioned at the high end and aimed at the market with high performance requirements.
Why is this? Dennis explained that there are many reasons. Arm is also rethinking the naming system. If we release a new product series name immediately after launching the first machine learning product, people may tie their impression and cognition of this series brand to this product. We don’t want to have such an effect. We hope that everyone can see that the product series under the product name Ethos is broad and rich. This is the main reason why we have to wait until three different markets and three different grades of products are released before officially announcing the name of the product series.
But more importantly, there are already many competitors in the market. Is it still competitive to launch Ethos at this time? Dennis said that Arm's success has always benefited from the ecosystem. For NPU, the challenge is not to be able to make its own NPU, but whether the NPU is really easy to use in the market. We have heard feedback from partners that they hope to standardize on hardware. They don't want to support 15 different hardware, so they expect a standardized software platform to support it.
Therefore, if Arm wants to gain market recognition, in addition to solving hardware and software challenges and making the product attractive enough, the ecosystem is also very important for the success of Ethos.
Three Customized Technologies for Data Management
First of all, it is clear that Arm Ethos-N77, N57 and N37 have the same core architecture, providing 1-4TOP/s computing power, and the three NPUs are independent of the process node technology and can choose to use different processes.
What is the uniqueness of the Arm NPU core architecture? Dennis said that the first advantage is data. When solving the problem of data transfer, we have three major features. First, the customization of compression. Different data types require different compression methods, so we have customized compression technology specifically for machine learning data. Second, try to minimize the data load, which is very similar to the memory cache. We have specially rearranged the machine learning and data processing to ensure that when a piece of data is loaded, the work that needs to access the data is completed as much as possible, and try not to carry the data. Third, we also use other technical means, such as specially developed unique pruning technology . Normal pruning technology has accuracy problems. Our specially developed pruning technology improves efficiency. Of course, many sparse technologies are also used.
In addition, there is a trade-off between efficiency and flexibility, allowing the hardware itself to have a life cycle of two to three years.
Looking at the specific products, the design concepts of Ethos-N57 and Ethos-N37 include some basic principles, such as: optimizing support for Int8 and Int16 data types; advanced data management technology to reduce data movement and related power consumption; and the implementation of Winograd technology to improve performance by more than 200% compared to other NPUs.
The data types supported by AI processors are very critical. As algorithms and models gradually mature, many edge AI chips only support the Int8 data type, but Arm also chooses to support Int16. In this regard, Dennis said that it is basically enough for machine learning to support the Int8 data type. The reason why we choose to support INT16 is to better cope with tasks involving image processing, because pixels are usually 10 to 12 bits, involving color, and supporting INT16 omits a lot of data conversion work, which is very suitable for image processing.
But why is there no support for the higher-precision FP16? "Because FP16 has a very high bandwidth requirement, this means that the overall processing capacity and power consumption of the processor will increase. However, despite consuming so much power and bandwidth, the accuracy is not much improved compared to INT8. " Dennis explained.
Although machine learning brings new challenges to processor design, Dennis believes that machine learning itself does not change the most fundamental design principles of processors, but the focus of processor design for machine learning may be different from that of general-purpose processors. Arm has always emphasized that data management is our focus, and parallel computing and matrix multiplication are also our focus.
The trade-off between software and hardware
In addition to the uniqueness of hardware, software is equally important, especially in the AI era, where the importance of hardware-software integration becomes more prominent. To achieve the ideal combination of hardware and software, Dennis believes there are two major challenges. One is the trade-off, that is, how much work should be done by hardware and how much by software. The other is the support of ML frameworks, because this field is still very new and different frameworks will emerge.
Dennis from Arm said that when we develop ML hardware, we first consider what the software needs, and then design the hardware. In fact, it is the software requirements that drive the hardware design. Arm has spent a lot of energy on the underlying software. More than half of the machine learning engineering team are engaged in software. We have been doing this for three years, but there is still a lot of room for improvement.
In addition to improving AI performance through software and hardware integration, in the context of increasingly expensive advanced semiconductor processes, improving processor performance through heterogeneous systems has also received a lot of attention. However, heterogeneous systems bring greater challenges to software. At this time, should we use a unified software API to allocate hardware resources to achieve ease of use, or program each hardware separately to make the system more efficient?
Arm uses a specially optimized Compute Library on top of the hardware. It fully optimizes the underlying hardware and drivers, and drives the hardware according to the different needs of the operators, which can improve efficiency by several to dozens of times. The next level is Arm nn, which can convert neural network frameworks such as TensorFlow and Caffe into tasks that can be executed by the Compute Library, so that developers do not need to worry about the underlying hardware and only need to use the standard architecture for development.
Therefore, Dennis said that the approach adopted by Arm is a more low-level approach, where the software directly communicates with the CPU, GPU or NPU to make the best match. The biggest challenge is the balance issue. The software architecture must allocate dedicated and general-purpose processors according to specific applications, and this percentage can be continuously adjusted, which is the most difficult thing to do.
In terms of framework support, Arm hopes that its hardware will allow developers to avoid having to specifically choose which framework to use.
The uniqueness and advantages of NPU hardware and software can only be successful if they are recognized by the market, so first of all, they must meet the computing power requirements of different scenarios. The performance of the three existing ML processors in the Ethos series ranges from 1-4 TOP/s, but even at the edge, there will be higher performance requirements, not to mention high-performance computing scenarios.
Dennis said that for high-performance scenarios, the ML product that Arm can provide is a structural unit that can be assembled to increase the structure of the processor. If multiple units that can provide 4TOP/S are assembled as required, higher performance requirements can be met.
However, as the complexity of the system increases, the increase in computing units does not mean that linear performance improvements can always be achieved. How does Arm respond? Dennis said that this kind of assembly does have its limitations, and the performance improvement may disappear after reaching a certain performance, but Arm has a good architectural design in GPU and CPU multi-processor architecture, which can achieve a relatively long linear performance improvement as much as possible.
"This is why we emphasize that the Ethos series will be a very long and very broad product line. We will extend this product line and find different ways to do machine learning, " Dennis further stated.
As mentioned earlier, the key to Arm's success is the ecosystem, and having a variety of dedicated and general-purpose chips is also Arm's advantage in the AI and IoT era. Also released at the same time as the NPU are the Mali-G57 using the latest Valhall architecture and the Mali-D37 with the highest efficiency per unit area of Arm.
Dennis still believes that the future market needs general-purpose and specialized chips with a wider range of applications. He said: "We have many machine learning applications that do not require particularly high performance, and CPUs can fully meet them. In addition, Arm's CPU performance is constantly improving, and now the cumulative performance has increased by 400 times. At the same time, specialized chip series like Ethos can also meet more diversified needs."
Not only that, Arm has also open-sourced Arm NN, which can be connected with third-party configurable IP, which can also adapt to more application scenarios.
In response to the more intense competition in the AIoT era, especially the competition from RISC-V, we also saw Arm announce a new feature, Arm Custom instructions, earlier this month, which allows customers to add custom instruction functions to specific CPU cores, which can accelerate specific use cases, embedded and IoT applications.
Dennis said, "We will take RISC-V's progress and actions in the market seriously, just as we take other architectures seriously. Arm's advantage is that we can provide the most comprehensive, flexible and universal solutions and products. At the same time, we also have a strong and rich ecosystem that can better meet market needs."
For Arm, both the AI and IoT markets must not be missed. For the mobile phone AI market, as a provider of general IP, it is obviously not suitable for Arm to launch an NPU product two years before the rapid iteration of AI algorithms. This will not guarantee that the NPU is still efficient for today's AI algorithms. From this perspective, it is understandable that Arm chose to launch the NPU in 2019, and the NPU launched by Arm at this time is also expected to solve the troubles of developers who need to adapt to a variety of NPU hardware to a certain extent.
From a technical perspective, the high energy consumption caused by data access and how to balance flexibility and efficiency are issues that all AI processor designers need to face. Arm's advantage lies in its rich experience in architectural design, as well as its long-standing software and ecological advantages, which enable it to solve the challenges it faces in a unique way.
Of course, the launch of Arm NPU has two important target markets, in addition to mobile phones, IoT. However, in the IoT market, Arm should take RISC-V as a competitor more seriously.
Previous recommendations
▎How did digital twin technology become popular?