Rambus Launches Industry's First HBM 4 Controller IP: What Are the Technical Details Behind It?-EEWORLD

Collect

"Computing power, storage power, and transportation power" are the three swords of today's data centers. As the number of parameters in generative AI and multimodal large models grows exponentially, their demand for computing capacity and high-bandwidth memory is becoming more urgent.

HBM (High Bandwidth Memory) is highly favored due to its huge memory bandwidth capability and has gradually become the "strongest auxiliary" of GPU. In other words, if you want to bring out the full capabilities of GPU, you must do a good job of HBM.

Today, JEDEC is moving towards finalizing the HBM4 memory specification, and the entire industry is also accelerating the commercialization of HBM 4. In September this year, Rambus announced the launch of the industry's first HBM4 controller IP. As a manufacturer that has moved quickly on HBM 4, what is its understanding of HBM 4 and how does it quickly complete product iterations? Recently, Rambus revealed more details to EEWorld at a communication meeting.

Why HBM Wins in the AI Era

According to Rambus researcher and distinguished inventor Dr. Steven Woo, AI can usually be divided into two processes: AI training and AI reasoning. AI training is one of the most challenging and difficult tasks in the current computing field. The amount of data managed and processed at this stage is extremely large. If the training process can be completed faster, it means that the AI model can be put into use earlier, thereby maximizing the return on investment. AI reasoning has high performance requirements, especially reasoning speed and accuracy.

The two stages pose their own challenges to memory performance: in the training stage, the memory needs to be fast enough, powerful enough, and small enough; in the inference stage, the memory also needs shorter latency and higher bandwidth, because the inference results must be given quickly and almost in real time.

This trend is reflected in market data. According to research, memory requirements for speed, capacity, and size are increasing more than 10 times each year, and this trend has shown no signs of slowing down since 2012.

For example, GPT-3, released in November 2022, uses 175 billion parameters, while the latest version GPT-4o, released in May this year, has more than 1.5 trillion parameters, which is equivalent to a 400-fold increase, but in the same period, the hardware memory size has only increased by 2 times. This means that in order to complete these AI model tasks, additional GPUs and AI accelerators must be invested to meet the demand for memory capacity and bandwidth.

At present, DRAM memory is mainly divided into several mainstream forms: DDR, LPDDR, GDDR, and HBM. In comparison, DDR is the memory in the traditional PC field; LPDDR is a low-power DDR designed for mobile devices, which also performs well in AI edge reasoning systems; GDDR is a DDR memory designed for graphics processing, which is now widely used in AI reasoning tasks. Because it achieves a good balance in bandwidth, cost and reliability, it also has certain applications in automobiles and networks; HBM high-performance memory has an extremely high bandwidth density, which is much higher than the common ordinary DRAM on the market, and is very suitable for AI training, high-performance computing and network applications.

Many AI systems will choose the corresponding memory type based on the specific needs of the actual application scenario. But from a performance perspective, HBM is far ahead.

Why does HBM have advantages in bandwidth, capacity and speed?

Steven Woo explained that from a structural point of view, in the HBM interposer, DRAM is connected to the processor, and the interposer is connected to the substrate, and finally the substrate is soldered to the PCB. In such a structure, the DRAM stack uses a multi-layer stack architecture to improve memory bandwidth, capacity and energy efficiency. In a large number of chip stacks, some memory chips are directly connected to the processor to achieve extremely fast transmission speeds; more importantly, taking HBM 3 as an example, there are 1024 wires connecting the DRAM stack and the SoC, and thousands of signal paths far exceed the range that a standard PCB can support. The silicon interposer can be similar to an integrated circuit, and can etch signal paths with very small spacing, thereby achieving the required number of signal lines to meet the requirements of the HBM interface.

It is precisely because of this sophisticated structural design and the stacking method of HBM DRAM that HBM memory can provide extremely high memory bandwidth, excellent energy efficiency, and extremely low latency while occupying the smallest area.

The rapid development of the HBM specification is driven by the growing demand for AI applications as they evolve from machine learning to more general and widely deployed AI. These applications pose critical performance and efficiency challenges to the underlying computing infrastructure.

What is different about HBM 4?

If you follow the news on memory, you will find that the giants have paid close attention to HBM 4 recently: SK Hynix's next-generation HBM4 was taped out in October and will be used in NVIDIA Rubin R100 AI GPU. NVIDIA, TSMC, and SK Hynix have formed a "triangle alliance" to develop the next-generation AI GPU + HBM4; Samsung will manufacture logic chips for the next-generation HBM4 memory on the 4nm node and will enter mass production at the end of 2025. At the same time, Samsung will develop "custom HBM4" solutions for Meta and Microsoft; Micron is also focusing on hybrid bonding technology for HBM4.

Regarding the development of HBM, Steven Woo introduced that from HBM, HBM2, HBM2E, HBM3E to the latest HBM 4, the most obvious change in each generation is the sharp increase in the bandwidth of a single stack, and the bandwidth of a single HBM3E device exceeds 1.2TB/s. Judging from the current market trend, HBM3 devices are becoming very popular. Major DRAM manufacturers such as SK Hynix, Micron and Samsung have announced the launch of HBM3E devices with a data transmission rate of up to 9.6Gbps.

Although JEDEC has not yet finalized the HBM 4 specification, it is certain that the bandwidth of each stack has exceeded HBM3E. From a single stack perspective, HBM4 bandwidth will reach 1.6TB/s, and this is only the bandwidth of a single stack. The actual bandwidth may be higher in the end. Therefore, a GPU with 8 HBM4 devices will achieve an aggregate memory bandwidth of more than 13 TB/s.

Compared to HBM3, the number of channels per stack will double, and the physical footprint will be larger. HBM4 will support 6.4 Gb/s speeds, and support for higher data rates is under discussion. Although the rate per pin is reduced from HBM 3, the interface is expanded from 1024 bits to 2048 bits, significantly improving the overall bandwidth.

Details behind the HBM 4 Controller IP

As the industry introduces faster and faster HBM memory devices, Rambus plays an important role in this process as a memory controller IP provider. Its industry-first HBM4 controller IP is designed to accelerate next-generation AI workloads.

Steven Woo introduced that Rambus' HBM4 controller IP provides 32 independent channel interfaces with a total data width of up to 2048 bits. Based on this data width, when the data rate is 6.4Gbps, the total memory throughput of HBM4 can be more than twice that of HBM3, reaching 1.64TB/s.

The working principle of the Rambus HBM4 controller IP is that the core of the controller itself accepts commands using a simple local interface and converts them into the command sequence required by the HBM4 device. At the same time, it also performs all initialization, refresh and power-down functions. The core queues multiple commands in the command queue, which allows for both short-distance transmission of highly random address locations and optimal bandwidth utilization for long-distance transmission of continuous address space. The command queue is also used to perform pre-read activation, pre-charge and automatic pre-charge in a timely manner, further improving the overall throughput. The reorder function is fully integrated into the controller command queue, improving throughput and minimizing the number of gates.

To summarize briefly, the features of Rambus' HBM 4 controller IP are: support for all standard HBM4 channel densities (up to 32 Gb); support up to 10 Gbps/pin; support refresh management (RFM); maximize memory bandwidth and minimize latency through Look Ahead command processing; integrated Reorder function; achieve high clock rate with minimal routing restrictions; self-refresh and power-down low power modes; support HBM4 RAS function; built-in hardware-level performance Activity Monitor; compatible with DFI; end-to-end data parity; support AXI or native interface for user logic.

More importantly, the controller can be delivered standalone or integrated with the customer's choice of PCIe 6.2.1 PIPE compliant SerDes. The delivery includes core (source code), testbench (source code), complete documentation, expert technical support, and maintenance updates.

Steven Woo emphasized that, like the previous generation HBM3E controller, the HBM4 controller IP is also a modular, highly configurable solution. "Based on the unique needs of customers in application scenarios, we provide customized services covering size, performance, and functionality. Key optional features include ECC, RMW, and error cleaning. In addition, in order to ensure that customers can choose a variety of third-party PHYs and apply them to the system as needed, we have cooperated with leading PHY suppliers to ensure that customers can succeed in the first tape-out during the development process."

Challenges in implementing HBM 4

[1] [2]

Keywords：Rambus HBM IP Reference address：Rambus Launches Industry's First HBM 4 Controller IP: What Are the Technical Details Behind It?

Previous article：Synopsys and ZAP join hands at the 2024 CIIE to help realize the world's first shieldless radiotherapy surgical robot
Next article：最后一页

Recommended ReadingLatest update time:2024-11-17 03:30

Cadence Achieves PCIe 5.0 PHY and Controller IP Compliance Certification for TSMC Advanced Processes

Cadence Achieves PCIe 5.0 PHY and Controller IP Compliance Certification for TSMC Advanced Processes Shanghai, China, June 23, 2022 - Cadence Design Systems, Inc. (Cadence) today announced that its PHY and controller IP for the PCI Express® (PCIe®) 5.0 specification for TSMC N7, N6 and N5 process

[Power Management]

A brief discussion on TCP/IP network communication module in LabVIEW

introduce LabVIEW has a powerful network communication function, which allows LabVIEW users to easily write LabVIEW application software with powerful network communication capabilities and realize remote virtual instruments. LabVIEW supports TCP/IP protocol, UDP protocol, etc. In addition, NI has also develo

[Test Measurement]

A brief discussion on TCP/IP network communication module in LabVIEW

Fully independent IP KungFu core: realizing a unified processor architecture from single core to multi-core

On November 14, 2022, the 2022 Second Smart Car Domain Controller was hosted by Gasgoo, guided by the Shanghai Hongqiao International Central Business District Management Committee and the Shanghai Minhang District People's Government, and co-organized by Shanghai South Hongqiao Investment and Development (Group) Co.,

[Automotive Electronics]

Fully independent IP KungFu core: realizing a unified processor architecture from single core to multi-core

Design and implementation of parameterized configurable IP core floating point arithmetic unit

Parametric configurable technology refers to the use of configurable resources in the design to reconfigure an operation system according to needs in an integrated development environment to achieve a system that takes into account both high-performance hardware design and configurable features, that is, a parametri

[Industrial Control]

Design and implementation of parameterized configurable IP core floating point arithmetic unit

Xiang Jianjun of Chengdu Ruichengxinwei: Domestic IP companies must find the right direction and win together

"Eight years ago today, Chengdu Ruichengxinwei was registered and established. Along the way, we have experienced the cold and warmth in the changing seasons, become strong in the blows, and persist in success! In 2016, we opened up the automotive electronics market. In October 2019, we merged with Shengxinwei, thus m

[Semiconductor design/manufacturing]

Rambus opens up new battlefield and acquires lighting display technology patents

In recent years, Rambus' main business has basically been in court, relying on patent licensing fees from various giants to survive. Now, in addition to storage technology, Rambus is preparing to open up a new battlefield. The company announced today that it has reached an agreement with Imagine Designs to purchase a

[Power Management]

US semiconductor experts discuss China's IP and heterogeneous computing development opportunities

At present, China is already the world's second largest economy, and ranks first in the world in many information technology, electronic products, and consumer goods fields such as mobile payments, smart phones, home appliances, and automobiles. However, in the global integrated circuit industry landscape, China's in

[Internet of Things]

US semiconductor experts discuss China's IP and heterogeneous computing development opportunities

Analysis of the application of IP access control system in hospital management

The key to hospital safety management is to ensure the safety of medical staff and prevent and reduce adverse events during the diagnosis and treatment of patients and medical staff. The IP access control system can integrate and form a highly developed integrated medical network through the sharing and interconnec

[Security Electronics]

Popular Resources
Popular amplifiers