3D DRAM storage and computing integrated architecture released by Tsinghua team

Latest update time：2024-08-10

Reads：

????If you hope to meet more often, please mark the star ?????? and add it to your collection~

ISCA (International Symposium on Computer Architecture) is the top conference in the field of computer architecture. This session was held in Buenos Aires, the capital of Argentina, from June 29 to July 3, 2024. The conference received 423 submissions and accepted 54 papers, with an acceptance rate of about 12.8%. Tsinghua University's School of Integrated Circuits published the world's first 3D DRAM storage and computing architecture for large visual AI models, which attracted the attention of academia and industry.

The paper is titled Exploiting Similarity Opportunities of Emerging Vision AI Models on Hybrid Bonding Architecture. Professor Yin Shouyi and Associate Professor Hu Yang are the corresponding authors of the paper, and Yue Zhiheng is the first author of the paper. Other collaborators include Assistant Professor Tu Fengbin of the Hong Kong University of Science and Technology and Professor Li Chao of Shanghai Jiao Tong University. The three-dimensional integrated DRAM storage and computing architecture proposed by the team (as shown in Figure 1) has greatly broken through the bottleneck of the storage wall, and based on the characteristics of the three-dimensional integrated architecture, it realizes similarity-aware computing, further improving the computing efficiency of large AI models.

Figure 1 3D DRAM storage and computing architecture

Background

The popularity of large artificial intelligence models has overturned people's perception of traditional AI and has been deployed in many fields, achieving excellent performance. However, the ever-increasing size of the models also introduces huge storage overhead. The frequent transfer between the storage units outside the chip and the computing units on the chip is subject to storage bandwidth constraints and requires a huge price. This is called the "storage wall" bottleneck.

Near-Memory Computing and the “Beachfront Problem”

In the traditional storage-computing separation architecture, the distance between computing and off-chip DRAM storage is long, resulting in high latency and power consumption in access. In response to the urgent need for bandwidth, HBM (High Bandwidth Memory) has been widely adopted as a solution in recent years [1] . HBM stacks 8-12 layers of storage cells vertically and uses 1024 through-silicon vias (TSVs) as data channels, effectively improving storage bandwidth. The HBM chip is then integrated with the computing chip on a silicon interposer using advanced packaging methods, and the computing chip and storage chip are closely integrated and packaged to achieve shorter distance transmission of data between the computing unit and the storage unit, thereby improving processing performance through "near-memory computing."

However, the high-bandwidth near-memory solution is still constrained by the "beach front problem" and cannot further break through the storage bottleneck. The beach front problem means that if the computing chip is an island, the location where the data I/O channel can be placed is the location of the island's beach, and the length of the beach is the total length of the I/O that can be placed. When constrained by factors such as signal crosstalk, the adjacent I/O locations are limited, resulting in the inability to further increase the number of I/Os under the 2.5D near-memory integration solution, making it difficult to increase bandwidth.

Two-dimensional in-memory computing and the "process bottleneck problem"

To further increase the available bandwidth of computing units, DRAM-based in-memory computing further integrates computing units inside the storage array. Specifically, computing units are integrated around each storage bank. After the bank data is read out, it is immediately processed by the adjacent computing units, realizing bank-level in-memory computing and effectively solving the beachhead problem of the two-dimensional near-memory solution.

However, when the computing unit is integrated inside the DRAM, the computing circuit must adopt the DRAM process. Compared with advanced logic processes, the computing circuit integrated in the DRAM array has poor performance and high area cost. At the same time, the introduced computing unit will occupy the DRAM storage array area, causing the storage capacity of the DRAM itself to decrease. For example, after the introduction of the in-memory computing unit, the storage capacity of Samsung HBM-PIM was reduced by 50% [2] .

Three-dimensional storage and computing fusion architecture,

From storage "wall" to storage "bridge"

In response to the bandwidth bottleneck of the near-memory architecture and the process bottleneck of the two-dimensional in-memory computing architecture, the research team explored the three-dimensional storage-computing integrated architecture for the first time. This solution effectively solves the "beach front problem" by stacking the computing unit and the DRAM storage unit in the vertical direction, and interconnecting the units with metal copper pillars as data channels. Data I/O can be placed at any position, greatly improving the data path density. The DRAM array and computing logic can be manufactured independently, and the logic circuit is not limited by the DRAM process and does not affect the storage capacity. In this architecture, the DRAM array consists of basic DRAM Banks. Each DRAM Bank and the corresponding computing Bank are stacked in the vertical direction through the hybrid bonding process, and the two exchange data through high-density copper pillars. The interconnected copper pillars are short in distance and small in parasitic capacitance. The data path is equivalent to a direct connection of the interconnection line. Each DRAM Bank and the corresponding computing Bank constitute a Bank-level storage-computing integrated unit (as shown in Figure 1).

The team also explored the design space under the bank-level storage and computing integrated architecture, including the computing bank computing power adapted to the DRAM Bank, the on-chip cache size of the computing bank, the area overhead introduced by three-dimensional integration, etc.; and conducted in-depth analysis of the hardware reliability and heat dissipation issues of the three-dimensional architecture, realizing a complete storage and computing integrated architecture design, significantly breaking through the storage wall bottleneck, and providing strong support for large AI model operations.

Similarity-aware three-dimensional storage and computing architecture

To further improve system performance, the design team proposed a similarity-aware three-dimensional storage and computing architecture. Experiments have found that when activation data is stored continuously in the storage array, the local area data has similarities, which this paper attributes to the cluster similarity effect of the stored data. Using this feature, the design team proposed that in the three-dimensional storage and computing architecture, each computing bank can independently and in parallel mine the similarity of the data in the corresponding DRAM bank, and use similar data to complete computing acceleration and improve system performance.

This integrated storage and computing design overcomes three key technical difficulties: 1. How to find similar data. Due to the large space of DRAM Bank, traversing and searching for similar data will introduce huge power consumption and time overhead; 2. How to use similar data. The previous integrated storage and computing unit was not designed for the characteristics of data similarity, and it was impossible to fully tap the performance gains it brought; 3. How to balance similar data. Since different computing banks are independent and parallel in the three-dimensional storage and computing architecture, the system performance is subject to the computing bank with the heaviest load. In order to solve the above difficulties, this integrated storage and computing architecture proposes three key technologies:

1. DRAM Bank Similar Data Search Solution Based on Hotspot Mechanism

The research team proposed to use the hotspot mechanism to complete the fast search for similar data. Hotspot data is data that is representative of regional information, that is, it has high similarity with most data in the region. This design uses content addressable units to collect hotspot data in different regions. When new data is read from the DRAM Bank, it first quickly searches for matching regional hotspot data in the unit. This hotspot data is used as a reference value to perform differential operations with subsequent read data (as shown in Figure 2). Due to the similarity between data, the differential results often have high sparse characteristics and can be used for computational acceleration.

Figure 2 Similarity-aware hardware acceleration unit

2. Progressively sparse computing units for similar data characteristics

When the DRAM Bank data is read out and differentially operated by the pre-processing unit, due to the similarity between the hot data and the regional data in the DRAM Bank, the XOR result often contains a large number of 0 values in the high bit positions. In view of this sparse characteristic, the storage-computing integrated architecture is designed with a progressive sparse detection mechanism. First, the complete data is divided into blocks according to the weight position to determine whether the data bit block is all 0. If it is all 0, the corresponding data block calculation is directly skipped, and the non-zero part is quickly located by the scoreboard hardware unit. Valid data. After completing the sparse detection, the scoreboard unit selects to send the non-redundant data block to the PE array for calculation, thereby skipping the sparse bits and improving the calculation efficiency (as shown in Figure 3).

Figure 3 Progressively sparse computing unit

3. Load balancing mechanism for data similarity differences

This storage-computing integrated architecture adopts bank-level parallelism, and the data similarity in the DRAM banks corresponding to different computing units may vary greatly (as shown in Figure 4). This is because data similarity is dynamically detected by the hardware unit at runtime and cannot be determined in advance during task mapping. To address the problem of unbalanced tasks in different computing banks, this solution uses the data similarity between DRAM banks to compress the task load and redistribute tasks between different computing banks, reducing the occupation of inter-chip routing network bandwidth and achieving bank-level load balancing and performance improvement.

Figure 4 Load imbalance due to data similarity differences

This work completed the design of the integrated storage and computing architecture, unit circuit implementation, and performance, power consumption, and area analysis. Experimental results show that in terms of systematic AI task load, this architecture has improved the effective computing throughput by 3.34 to 9.61 times (as shown in Figure 5) and the computing energy efficiency by 5.69 to 28.13 times (as shown in Figure 6) compared to publicly reported high-computing AI chips [3-5] . The three-dimensional solution also improved the area efficiency by 3.82 to 10.98 times.

Figure 5 Effective throughput improvement

Figure 6 Effective energy efficiency improvement

Summarize

Traditional 2D/2.5D integration solutions cause storage wall problems due to slow data movement between storage and computing units. The 3D storage and computing integrated architecture solution greatly reduces data read and write latency, improves data movement efficiency, and greatly improves AI model processing performance by physically integrating computing units with storage units. This will help solve the significant memory bottleneck problem currently faced by large language models on traditional hardware.

Reference Links

[1] J. Lee et al., "13.4 A 48GB 16-High 1280GB/s HBM3E DRAM with All-Around Power TSV and a 6-Phase RDQS Scheme for TSV Area Optimization," 2024 IEEE International Solid-State Circuits Conference ( ISSCC), San Francisco, CA, USA, 2024, pp. 238-240, doi: 10.1109/ISSCC49657.2024.10454440.

[2] Jin Hyun Kim et al., Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond, Hotchip33

[3] Drago Ignjatovic ́, Daniel W. Bailey, and Ljubisa Bajic ́. The wormhole ai training processor. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 356–358, 2022.

[4] Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, and David Patterson. The design process for google's training chips: Tpuv2 and tpuv3. IEEE Micro, PP(99):1–1, 2021.

[5] Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. Ten lessons from three generations shaped google's tpuv4i : Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14, 2021.

Click here???? to follow us and lock in more original content

END

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 3853rd content shared by "Semiconductor Industry Observer" for you, welcome to follow.