Tsinghua Software Defined Chip Team Proposes Inter-DIMM Broadcast Technology to Overcome Communication Bottlenecks in DIMM Near-Memory Computing Systems | ISCA 2021
Yunzhong sent this from Aofei Temple
Edited by QuantumBit | Public Account QbitAI
From June 14 to June 17, 2021, the 48th International Conference on Computer Architecture (ISCA) was successfully held online. Professors Wei Shaojun and Liu Leibo from Tsinghua University gave an academic report entitled "ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-based Near Memory Processing with Inter-DIMM Broadcast".
This report proposes a communication optimization method based on inter-DIMM broadcast technology to address the communication bottleneck problem of DIMM (dual in-line memory module) near-memory computing architecture . This method fully utilizes the scalability of memory bus broadcast and the wide applicability of the broadcast mechanism, providing a powerful new tool for communication optimization of DIMM near-memory computing.
The speaker Sun Weiyi is the first author of the paper (as shown in Figure 1) and is currently pursuing a doctorate degree at the School of Integrated Circuits, Tsinghua University. The corresponding author of the paper is Professor Liu Leibo, and the main collaborators include Li Zhaoshi, Yin Shouyi, etc.
△ Figure 1 The main work of Sun Weiyi's paper report
Currently, with the widespread deployment of data-intensive applications, traditional main memory systems have been unable to cope with the growing capacity and bandwidth requirements. To meet this challenge, many near-memory computing architectures have been proposed, among which the DIMM-based near-memory computing architecture is recognized as one of the most promising architectures (as shown in Figure 2).
This architecture integrates computing logic into the DIMM cache chip, and achieves a higher total memory access bandwidth by allowing multiple DIMMs in the memory channel to access memory and perform calculations in parallel, thereby achieving higher performance improvement potential at a lower design and production cost. However, the performance improvement of DIMM near-memory computing systems depends on the increase in the number of DIMMs, but the existing point-to-point communication mechanism between DIMMs based on the memory bus may seriously restrict the scalability of system performance relative to the number of DIMMs.
Specifically, when the number of DIMMs in a memory channel increases, the average point-to-point communication bandwidth allocated to each DIMM decreases rapidly. For many important data-intensive applications, the communication between each DIMM and the CPU dominates the program's running time, greatly limiting the overall performance of the system.
△ Figure 2 DIMM-based near-memory computing architecture
To address this problem, the team of Wei Shaojun and Liu Leibo proposed the inter-DIMM broadcast technology.
From a hardware perspective, the bus system naturally supports broadcast at the physical level, and the effective broadcast bandwidth of the main memory bus naturally expands as the number of DIMMs increases. From a software perspective, a large number of data-intensive applications can be implemented in a "broadcast-dominated" manner.
Based on the above ideas, the team designed the ABC-DIMM system , which eliminates the communication bottleneck in the DIMM near-memory computing architecture by implementing and utilizing "inter-DIMM broadcast" in the main memory. The system consists of three parts.
First, the team designed a "broadcast-computing" programming framework to guide programmers to implement various applications in a broadcast-dominated manner, so that the software can make full use of "inter-DIMM broadcast" to optimize communication. As shown in Figure 3 (a), it divides tasks by splitting outputs, while communication between tasks is dominated by the broadcast of input data.
Secondly, the team provides a complete "inter-DIMM broadcast" mechanism for "intra-memory channel" and "inter-memory channel", as shown in Figure 3 (b) (c). Using these mechanisms, the "broadcast-computation" framework can be efficiently implemented in communication under multiple memory channels, as shown in Figure 3 (d).
Finally, the team provided a full-stack hardware and API design for the "Broadcast between DIMMs" mechanism. To make the system implementation as simple and cheap as possible, the team successfully controlled the design overhead and scope within the DIMM cache chip and the CPU's memory controller. Specifically, by adding an instruction translation module to the cache chip, "Broadcast between DIMMs" can be integrated into the main memory system in the form of new DDR instructions without changing the DRAM chip. In addition, through limited modifications to the memory controller and corresponding API design, "Broadcast between DIMMs" can be effectively used by software without changing the ISA.
Simulation evaluation shows that the average performance of ABC-DIMM is 2.50 times and 2.93 times that of two mainstream baseline near-memory systems, respectively.
Over the past 10 years, the team of Professors Wei Shaojun and Liu Leibo has made a number of important technological breakthroughs in the field of software-defined chips. Key technologies have been applied in large quantities in many major national projects. They have won the second prize of the National Technological Invention Award, the first prize of the Technological Invention Award of the Ministry of Education, the first prize of the Technological Invention Award of the Institute of Electronics, the China Invention Patent Gold Award, and 15 world-leading Internet scientific and technological achievements of the World Internet Conference.
△ Figure 3 (a) "Broadcast-Compute" programming framework (b) Broadcast mechanism within the memory channel
(c) Broadcast mechanism between memory channels (d) Multi-core implementation of the communication part of the "broadcast-compute" framework under multiple memory channels
About ISCA
ISCA (International Symposium on Computer Architecture) is an important international conference that proposes or discovers new ideas, methods and achievements in computer architecture. It is known as one of the most authoritative conferences in the field of computer architecture and is one of the top three architecture conferences along with MICRO and HPCA. Superscalar architecture, multi-level cache, simultaneous multithreading and cache consistency were all first proposed at ISCA. Since 1973, ISCA has been successfully held for 48 sessions.
-over-
click here