A look at ARM’s AI processor
Recently, ARM has further disclosed some information about the ML Processor. The EETimes article “Arm Gives Glimpse of AI Core”[1] and the AnandTech article “ARM Details “Project Trillium” Machine Learning Processor Architecture” introduced the information from different perspectives, which are worth our careful analysis.
ARM released its ML Processor on the eve of this year’s Spring Festival. Not much information was released at that time, and I also did a simple analysis ( AI chip starts the year ).
ARM has released more information this time, let's take a look. First, the key features and some important information, will be released in mid-2018.
Top-level architecture
Compared with the basic block diagram initially released, this time we see a more detailed module block diagram and connection relationship, as shown in the figure below.
The top layer of MLP looks like a typical hardware accelerator. It has local SRAM and exchanges data and main control information (instructions) with the outside world through an ACE-Lite interface. There should also be some control signals, which are probably omitted here (you can refer to Nvidia's NVDLA).
In the figure above, the green arrows should represent data flow, and the red arrows represent control flow. The CE in MLP shares a set of DMA, Control Unit and Sync Unit. Its basic processing flow is as follows: 1. Configure Control Unit and DMA Engine; 2. DMA Engine reads data from the outside (such as DDR) and stores it in the local SRAM; 3. Input Feature Map Read module and Weight Read module read the feature map and weight to be calculated, process (such as weight decompression), and send them to MAC Convolution Engine (hereinafter referred to as MCE); 4. MCE performs convolution and other operations, and transmits the results to Programmable Layer Engine (hereinafter referred to as PLE); 5. PLE performs other processing and writes the results back to the local SRAM; 6. DMA Engine transfers the results to external storage space (such as DDR) .
The Broadcast interface marked at the top level implements the function of broadcasting feature map data between multiple Compute Engines (hereinafter referred to as CEs). Therefore, the basic convolution operation mode is that the same feature map is broadcast to multiple CEs, and different CEs use different weights to
operate with these feature maps.
From the current configuration, MLP includes 16 compute engines, each with 128 MACs, that is, a total of 16x128=2048 MACs, and each cycle can perform 4096 operations. If you want to achieve the total processing power of 4.6TOPS mentioned by ARM, you need a clock cycle of about 1.12GHz. Since this indicator is for the 7nm process, it is not a big problem to achieve it
.
MCE achieves efficient convolution
In the MLP architecture, MCE and PLE are the most important functional modules. MCE provides the main computing power (processing 90% of the operations) and should be the part with the largest area and power consumption in MLP. Therefore, one of the main goals of MCE design optimization is to achieve efficient convolution operations. Specifically, the design of MLP mainly considers the following methods, most of which we have discussed before.
An interesting point is the “varied internal precision” mentioned above. Its specific meaning is not clear at present. However, for applications, what they should see is a fixed 8-bit data type.
As for support for low-precision inference, the information provided in [1] is,
“
The team is tracking research on data types down to 1-bit precision, including a novel 8-bit proposal from Microsoft. So far, the alternatives lack support in tools to make them commercially viable, said Laudick.
”
Therefore, in the first version of MLP, we should not see low-precision or bit-serial MAC (refer to
the introduction of bit-serial processing that appeared at ISSCC2018 in
the AI chip opening year
).
In addition, data compression and process optimization are also the main means to improve overall efficiency. In particular, process optimization, combined with ARM's process library, should have a better effect, which is also where ARM has an advantage.
PLE enables efficient programmability
As shown in the figure below, the structure of PLE is basically an ARM MCU with extended instructions for vector processing and NN processing. When discussing programmability, the starting point is that NN algorithms and architectures are still evolving.
We have analyzed the basic workflow of the entire MLP. After completing the operation, MCE transmits the result to PLE. From here, we can see that MCE should send the result to Vector Register File (VRF), and then generate an interrupt to notify the CPU. After that, the CPU starts Vector Engine to process the data. The details are shown in the figure below.
For those who have worked on special processors, this scalar CPU + vector engine architecture is not unfamiliar. Here, there are Load/Store units and uDMA to realize data transmission between local SRAM, VRF and Maining SRAM Unit (SRAM in CE) outside PLE, and the data flow is also relatively flexible.
In general,
in MLP, each CE has a PLE and MCE to cooperate, that is, each MCE (128 MACs) corresponds to a programmable architecture
. Therefore, the programmability and flexibility of ARM MLP are much higher than those of Google TPU1 and Nvidia's NVDLA. Of course, flexibility also means more additional overhead, as pointed out in [1], "
The programmable layer engine (PLE) on each slice of the core offers "
just enough programmability to perform [neural-net] manipulations
"
". High-efficient programmability is one of the main selling points of MLP, and whether ARM's "just enough" is really the most suitable choice remains to be further observed.
Other Information
In this release, ARM also emphasized their considerations on data compression, including hardware support for lossless compression. I have discussed this in more detail in my previous article, so I will not repeat it here. I will post a few interesting pictures for you to see.
As an IP core, configurability is an important feature. It is not yet known which hardware parameters of MLP can support flexible configuration. The larger parameters such as the number of Compute Engines, the number of MACs, and the size of SRAM should be configurable. Other more detailed contents will depend on the final release. In addition, the configuration of these parameters is closely related to the relevant software tools. More configurable parameters also mean that the software tools need corresponding support, which is more difficult. [2] This statement: "
In terms of scalability the MLP is meant to come with configurable compute engine setups from 1 CE up to 16 CEs and a scalable SRAM buffer up to 1MB. The current active designs
however
are the 16CE and 1MB configurations and smaller scaled down variants will happen later on in the product lifecycle.
"
Competition
In addition to the relatively standard performance indicators, ARM has not yet announced the specific area, power consumption and other parameters of MLP, as well as the specific release date (the current statement is that "production release of the RTL is on track for mid-year").
In this already crowded market, ARM is obviously a slow mover. [1] At the beginning, it was mentioned that “ Analysts generally praised the architecture as a flexible but late response to a market that is already crowded with dozens of rivals. ” and listed some examples of competitors.
In fact, considering ARM's key position in the processor IP market and the entire ecosystem, it doesn't matter if it's late. As [1] said, on the one hand, ARM is working closely with some smartphone manufacturers, " In a sign of Arm's hunger to unseat its rivals in AI, the company has "gone further than we normally would, letting [potential smartphone customers] look under the hood " ".
Another important advantage of ARM is that it had some preparation in software tools before launching MLP, including armnn and open source computing libraries, as shown in the figure below.
The widespread use of these tools can help ARM accumulate experience and optimize hardware and software tools. As quoted from ARM in [1] , " Winning the hearts and minds of software developers is increasingly key in getting design wins for hardware sockets ... This is kind of the start of software 2.0. For a processor company, that is cool. But it will be a slow shift, there's a lot of things to be worked out, and the software and hardware will move in steps . "
We also see that currently a large number of embedded AI applications are still running on various ARM hardware. Many companies
have invested a lot of effort in optimizing related algorithms and implementations, and have achieved good results
.
Of course, this brings up another interesting question, that is, after the introduction of MLP in the future, where will the ML tasks be run? How do processors with different characteristics cooperate? This issue is also mentioned in the article, "
Arm will release more data on the core's performance when it is launched, probably in mid-June. But don't expect detailed guidance on
when to run what AI jobs on its CPU, GPU, or new machine-learning cores, a complex issue that the company, so far, is leaving to its SoC and OEM customers
.
" It seems that this "difficult problem" will still be thrown to users in the short term.
Another detail worth noting is that [1] mentioned, “ Theoretically, the design scales from 20 GOPS to 150 TOPS, but the demand for inference in the Internet of Things will pull it first to the low end. Arm is still debating whether it wants to design a core for the very different workloads of the data center that includes training. “We are looking at [a data center core], but it's a jump from here,” and it's still early days for thoughts on a design specific for self-driving cars, said Laudick. ” From this, we can see that at least MLP has relatively strong scalability in terms of processing power, and should be able to cover most inference applications from Edge to Cloud. If it is the highest 150TOPS, the scale of MAC should be similar to that of Google's first-generation TPU dedicated to inference. However, compared with Google's systolic array architecture, MLP has a more complex control channel and is much more flexible. I wonder if this will help ARM open up the data center inference market in the future.
refer to:
1. "Arm Gives Glimpse of AI Core", https://www.eetimes.com/document.asp?doc_id=1333307
2. "ARM Details “Project Trullium” Machine Learning Processor Architecture", https://www.anandtech.com/show/12791/arm-details-project-trillium-mlp-architecture.
Just scan and follow us~