NPU, how to see it?

Latest update time：2024-11-11

Reads：

????If you hope to meet more often, please mark the star ?????? and add it to your collection~

Source: Content compiled from quadric, thank you.

There are dozens of NPU options on the market today. Each has competing and conflicting claims about efficiency, programmability, and flexibility. One of the most glaring differences between these options is the seemingly simple question: what is the “best” choice for memory placement relative to compute in the NPU system hierarchy.

Some NPU architectural styles rely heavily on direct or exclusive access to system DRAM, relying on the relative cost-per-bit advantages of high-volume commodity DRAM over other memory options, but are constrained by partitioning issues across multiple chips. Other NPU options rely heavily or entirely on on-chip SRAM for speed and simplicity, but at the expense of high silicon area costs and lack of flexibility. Still others employ novel memory types (MRAM) or novel analog circuit structures, both of which lack a proven, widely used manufacturing record. Despite the wide variety of NPU options, they generally align with one of three memory locality styles. These three styles bear an uncanny resemblance (pun intended) to the children’s story The Three Bears!

The children’s fairy tale Goldilocks and the Three Bears describes the adventures of Goldi as she tries to choose between three options for bedding, a chair, and a bowl of porridge. One meal is “too hot,” another is “too cold,” and the last is “just right.” If Goldi is faced with making architectural choices for AI processing in a modern edge/device SoC, she will also face three choices regarding the placement of compute power relative to local memory used to store activations and weights.

In, at or close to?

The terms compute-in-memory (CIM) and compute-near-memory (CNM) originated from architectural discussions in data center system design. There is a large literature discussing the merits of various architectures. All analysis boils down to trying to minimize the power consumed and latency in moving working data sets between processing elements and storage elements in the data center.

In the world of AI inference systems-on-chip (SoCs) optimized specifically for edge devices, the same principles apply, but there are three levels of proximity to consider: in-memory computing, in-memory computing, and near-memory computing. Let’s quickly examine each level.

In-memory computing: a mirage

In-memory computing refers to the various attempts over the past decade to pack computing into the memory bit cells or memory macros used in SoC designs. Almost all of these attempts employ some kind of analog computing within the bit cells of the DRAM or SRAM (or more exotic memories like MRAM) in question. In theory, these approaches speed up computation and reduce power consumption by performing computations (especially multiplications) in the analog domain and in an extensively parallel manner. While this seems to be a compelling idea, it has failed so far.

The reasons for the failure are multifaceted. First, the widely used on-chip SRAM has been perfected/optimized for nearly 40 years, as has the off-chip storage DRAM. Using a highly optimized approach results in inefficiencies in area and power compared to a pure starting point. It turns out that injecting this new approach into the tried-and-tested standard cell design methods used by SoC companies is not feasible. Another major disadvantage of in-memory computing is that these analog approaches only perform a very limited subset of the calculations required for AI inference - namely, matrix multiplication at the core of convolution operations. However, no in-memory computing can be built with enough flexibility to cover all possible convolution variations (size, stride, expansion) and all possible MatMul configurations. In-memory analog computing also cannot implement the other 2300 operations in the world of Pytorch models. Therefore, in addition to in-memory computing solutions, they need to have mature NPU computing capabilities in addition to in-memory analog enhancements - when this memory is used in the traditional way for all the calculations that occur on the accompanying digital NPU, the "enhancements" will be burdensome in terms of area and power.

The final analysis showed that the memory solutions for edge device SoCs were “too limited” to be of any use to the intrepid chip designer Goldi.

Near Memory Computing:

Close-range computing is still a long way off

At the other end of the spectrum of SoC inference design approaches is to minimize the use of on-chip SRAM memory and maximize the utilization of low-cost, high-capacity memory (primarily DDR chips) produced in mass production. This concept focuses on the cost advantages of large-scale DRAM production and assumes that with minimal on-SoC SRAM and sufficient bandwidth for low-cost DRAM, the AI inference subsystem can reduce SoC cost, but relies on fast connections to external memory (usually dedicated DDR interfaces managed only by the AI engine) to maintain high performance.

While at first glance the near-memory approach may successfully reduce the die area of an SoC used for AI, thereby slightly reducing system cost, it has two major drawbacks that can impair system performance. First, the power consumption of such a system is abnormal. Consider the following table, which shows the relative energy cost of moving a 32-bit data word into or out of the multiply-accumulate logic of each AI NPU core:

Each data transfer from the SoC to the DDR consumes 225 to 600 times the energy (power) of a transfer locally adjacent to the MAC unit. Even on-chip SRAM, which is quite “far” from the MAC unit, is 3 to 8 times more energy efficient than off-chip transfers. Since most of these SoCs are power constrained by consumer-grade devices, the power constraints of relying primarily on external memory make near-memory design points impractical. Furthermore, the latency of always relying on external memory means that as newer, more complex models evolve, which are likely to have more irregular data access patterns than the old Resnet, near-memory solutions will suffer severe performance degradation due to latency.

The double whammy of too much power and too little performance meant that the near-memory approach was "too hot" for our chip architect Goldi.

At-Memory: Just Right

Just as the children’s Goldilocks fable always offers a “just right” alternative, in-memory compute architectures are the just right solution for edge and device SoCs. Referring again to the data transfer energy costs in the table above, the best choice for memory location is clearly the immediately adjacent on-chip SRAM. Saving computed intermediate activation values to local SRAM consumes 200x less power than pushing that value off-chip. But this doesn’t mean you want to use only on-chip SRAM. Doing so places a hard upper limit on the model size (weight size) that can fit in each implementation.

The best option for SoC designers is to leverage small local SRAMs (preferably distributed in large quantities across an array of compute elements) and intelligently schedule data movement between these SRAMs and off-chip storage in DDR memory to minimize system power consumption and minimize data access latency.

END

????Semiconductor boutique public account recommendation????

▲Click on the business card above to follow

Focus on more original content in the semiconductor field

▲Click on the business card above to follow

Focus on the trends and developments of the global semiconductor industry

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 3943rd content shared by "Semiconductor Industry Observer" for you, welcome to follow.