Article count:25239 Read by:103424336

Account Entry

Kill HBM?

Latest update time:2024-07-30
    Reads:

????If you hope to meet more often, please mark the star ?????? and add it to your collection~


AI cannot do without HBM, and this view is gaining popularity.


For large AI model training, large amounts of parallel data processing require high computing power and high bandwidth. The computing power determines how fast data can be processed per second, while the bandwidth determines how much data can be accessed per second. The GPU is responsible for providing computing power, while the memory is responsible for providing bandwidth.


The current situation is that people may not need Nvidia's GPU, but they absolutely cannot do without Hynix, Samsung or Micron's HBM. Although Nvidia has the moat of CUDA, it cannot completely prevent users from migrating to other manufacturers. But HBM is different. Whether it is AMD, Intel, or other customized chips, they are all embedded with densely packed HBM without exception.


But HBM is not ordinary DRAM. Its price has already reached a staggering level. Under the same density, the price of HBM is about 5 times that of DDR5. It is understood that the current cost of HBM ranks third in the cost of AI servers, accounting for about 9%, and the average selling price of a single machine is as high as US$18,000.


(Image from Micron)


Even such an expensive HBM is still in short supply and the price is still rising. TrendForce said in May this year that the 2025 HBM pricing negotiations have started in 2Q24, but due to the limited overall DRAM production capacity, suppliers have initially raised prices by 5~10% to manage production capacity constraints, affecting HBM2e, HBM3 and HBM3e.


It pointed out that from the perspective of major AI solution providers, the demand for HBM specifications will clearly shift towards HBM3e, and 12Hi stacked products are expected to increase, which will drive the HBM capacity of a single chip to increase. It is estimated that the annual growth rate of HBM demand in 2024 will be close to 200%, and is expected to double again in 2025.


Giants who can afford it will continue to pay more to buy HBMs with larger capacity, but for small and medium-sized manufacturers, expensive HBMs have become the biggest obstacle for them to embark on the road to large models.


Who will help with expensive AI memory?




Silicon Sage, we need to bring down the price of memory




"Silicon Sage" Jim Keller has criticized the high prices of current AI chips more than once.


Who is Jim Keller? His career spans DEC, AMD, SiByte, Broadcom, PA Semi, Apple, Tesla, and Intel. From AMD's K8 architecture, to Apple's A4 and A5 processors, to AMD's Zen architecture, and finally Tesla's FSD autonomous driving chip, this great man is behind them all.


In 2021, he left Intel and joined Tenstorrent, an AI chip startup in Toronto, Canada, as the company's CTO, responsible for developing the next generation of AI chips.


Keller has been working on solving the high cost of AI hardware, seeing it as a way for startups like Tenstorrent to challenge giants like Nvidia, and he has suggested that Nvidia could have saved $1 billion when it developed its Blackwell GPU by using Ethernet interconnect technology.


"There are a lot of markets that are not well served by Nvidia," Keller said in an interview with Nikkei Asia. As the application of AI in smartphones, electric vehicles and cloud services continues to expand, more and more companies are looking for cheaper solutions. He mentioned, "There are a lot of small companies that are not willing to pay $20,000 for Nvidia's high-end GPUs, which are considered the best choice in the market."


Tenstorrent is preparing to sell its second-generation all-purpose AI chip by the end of this year. The company says its energy efficiency and processing efficiency are better than Nvidia's AI GPUs in some areas. According to Tenstorrent, its Galaxy system is three times more efficient than Nvidia's DGX AI server and costs 33% less.


Keller said one reason for the achievement is that the company doesn’t use high-bandwidth memory (HBM), an advanced memory chip that can quickly transfer large amounts of data. HBM is a key component of generative AI chips and has played a major role in the success of Nvidia’s products.


However, HBM is also one of the main culprits for the high energy consumption and high price of AI chips. "Even people who use HBM are struggling with its cost and design time," said Keller, so he made a technical decision not to use this technology.


In a typical AI chipset, the GPU sends data to memory every time it performs a process. This requires the high-speed data transfer capabilities of HBM. However, Tenstorrent has specially designed its chips to significantly reduce such transfers. Keller said that with this new approach, the company's chips can replace GPUs and HBM in certain areas of AI development.


He also said the company is designing its products to be “cost-effective” whenever possible. He added that many other companies are also looking for better memory solutions, but he was careful to admit that disrupting the existing massive HBM industry will take years.


Keller predicts that more new players will emerge to fill various AI markets that Nvidia fails to serve, rather than one company replacing Nvidia.


It is worth mentioning that Tenstorrent’s chief CPU architect has previously shared a similar view, emphasizing that the company’s more pragmatic and economical spirit makes its system design more cost-effective and computing-capable than Nvidia.


“Customers don’t need to pay Porsche or Ferrari prices to run their generative AI models, they just need the car that’s the most cost-effective and fastest,” he said. Lien expects the current trend of expensive hardware to fade and the market to eventually stabilize.


For Jim Keller, the overly expensive HBM seems to have hindered the development of AI. Although large companies have strong financial resources to bear all of this, small companies have long been unable to sustain it. The Tenstorrent chip he is responsible for is designed to solve this problem.




AI chips without HBM?




In May 2020, Tenstorrent launched its first product, Grayskull, a processor of about 620 square millimeters based on the GF 12nm process, originally designed as an inference accelerator and host. It contains 120 custom cores with a 2D bidirectional grid structure, providing 368 TeraOPs of 8-bit computing power and consuming only 65 watts of power. Each custom core is equipped with a data-controlled packet management engine, a packet computing engine containing Tenstorrent's custom TENSIX core, and five RISC cores for non-standard operations such as conditional statements. The chip focuses on sparse tensor operations, optimizes matrix operations into compressed data packets, and implements pipeline parallelization of computing steps through a graph compiler and a packet manager. This also enables dynamic graph execution, which allows computing and data to be transmitted asynchronously, rather than in a specific computing/transmission time domain, compared to some other AI chip models.


In March this year, Tenstorrent began selling two development boards based on Grayskull. Tenstorrent said that Grayskull e75 and e150 are Tenstorrent's basic, AI graphics processors used only for inference, each built with Tensix Cores, including a computing unit, on-chip network, local cache and "small RISC-V" core, to achieve unique and efficient data movement in the chip, designed for adventurous ML developers seeking cost-effective and customizable alternatives to traditional GPUs.


The Grayskull e75 is a 75-watt PCIe Gen 4 card priced at $600. It has a 1GHz NPU chip with 96 Tensix cores and 96MB of SRAM. The board also includes 8GB of standard LPDDR4 DRAM. The Grayskull e150 increases the clock frequency to 1.2GHz, increases the number of cores to 120, and increases the on-chip memory to 120MB, but the off-chip DRAM is still 8GB of LPDDR4. Its power consumption is increased to 200 watts and it costs $800.


It is understood that the Tenstorrent Grayskull architecture is different from other data center AI accelerators (GPU/NPU). The arranged Tensix cores contain multiple CPUs for use by computing units, which include vector and matrix engines. This structured granular approach can increase the utilization of math units, thereby improving performance per watt. Each Tensix core also has 1MB of SRAM, providing ample total on-chip memory. Unlike other NPUs with large memories, Grayskull can be connected to external memory.


Of course, the most important thing is that Grayskull uses standard DRAM instead of expensive HBM. This alone saves more than half of the cost, which is also in line with the goal of pursuing cost-effectiveness mentioned by Jim Keller.


Software is a weak link for NPU and other processor challengers, and a strong point for Grayskull compared to its competitors. Tenstorrent offers two software flows for this: TT-Buda maps models to Tenstorrent hardware based on standard AI frameworks such as PyTorch and TensorFlow, while TT-Metalium provides developers with direct hardware access and allows them to create libraries for use in higher-level frameworks. Powered by the Grayskull architecture, Metalium stands out for providing a computer-like programming model and may attract customers with low-level programming resources.


In addition, from the beginning, Tenstorrent has used power efficiency as a differentiating factor, and the e75's relatively low 75 watts fits within standard PCIe and OCP power envelopes, and a design like this could be a good server add-on board for inference. In addition to Grayskull chips and boards, Tenstorrent has also begun licensing its high-performance RISC-V CPUs and Tensix cores, and is working with partners to develop chiplets.


Of course, this is just the beginning. After Jim Keller joined, Tenstorrent's ambitions began to grow.


In July, Tenstorrent launched a new generation of Wormhole processors designed for AI workloads, promising good performance at a low price. The company currently offers two add-on PCIe cards, each with one or two Wormhole processors, as well as TT-LoudBox and TT-QuietBox workstations designed for software developers. This launch is mainly aimed at developers, not those who use Wormhole boards for commercial workloads.


“It’s always satisfying to get more products into developers’ hands,” said Jim Keller, CEO of Tenstorrent. “The release of a development system with Wormhole cards will help developers scale and develop multi-chip AI software. In addition to this announcement, we are also pleased that the tapeout and start-up of our second generation product, Blackhole, is going well.”


Each Wormhole processor is equipped with 72 Tensix cores (including five RISC-V cores that support various data formats), has 108MB of SRAM, and provides 262 FP8 TFLOPS of performance at 1GHz, with a power consumption of 160W. The single-chip Wormhole n150 card is equipped with 12GB of GDDR6 memory with a bandwidth of 288GB/s.


The Wormhole processor offers flexible scalability to meet the needs of a wide range of workloads. In a standard workstation setup, four Wormhole n300 cards can be combined into a single unit, appearing in the software as a unified, extensive network of Tensix cores. This configuration allows the accelerator to handle the same workload, distributed to four developers, or run up to eight different AI models simultaneously. A key feature of this scalability is that it can run natively without virtualization. In a data center environment, the Wormhole processor can scale within a single machine via PCIe or between multiple machines via Ethernet.


From a performance perspective, Tenstorrent's single-chip Wormhole n150 card (72 Tensix cores at 1GHz, 108MB SRAM, 12GB GDDR6, and a bandwidth of 288GB/s) delivers 262 FP8 TFLOPS at 160W power consumption, while the dual-chip Wormhole n300 board (128 Tensix cores at 1GHz, 192MB SRAM, 24GB GDDR6, and a bandwidth of 576GB/s) delivers up to 466 FP8 TFLOPS at 300W power consumption (according to Tom's Hardware).


Compared with NVIDIA's products, NVIDIA's A100 does not support FP8, but supports INT8, with a peak performance of 624 TOPS (1,248 TOPS when sparse), while NVIDIA's H100 supports FP8 and has a peak performance of up to 1,670 TFLOPS (3,341 TFLOPS when sparse), which is a big gap compared to Tenstorrent's Wormhole n300.


What they lack in performance they make up for in price, with Tenstorrent's Wormhole n150 costing $999 and the n300 costing $1,399. By comparison, an Nvidia H100 card can cost as much as $30,000.


In addition to the boards, Tenstorrent also offers developers workstations with four n300 cards pre-installed, including the lower-priced Xeon-based TT-LoudBox and the high-end EPYC-based TT-QuietBox.


Whether it is Grayskull or Wormhole, they are just the first step in the Tenstorrent roadmap, and the real highlight is yet to come.


(Image from Tenstorrent)


According to the roadmap disclosed by Tenstorrent, the second-generation Blackhole chip has 140 Tensix cores, more DRAM and faster Ethernet. It also has 16 RISC-V cores, which are independent of the Tensix cores and can run the operating system without the x86 CPU. It has been taped out on TSMC N6 and is progressing smoothly.


Tenstorrent's third-generation architecture will be based on the chipset and will migrate to Samsung SF4, which includes Quasar and Grendel, and will adopt the updated Tensix core. The purpose is to cluster four Tensix cores with a shared L2 to better reuse the weights already in memory. They are expected to be launched in 2025.


Of course, the three subsequent chips in the roadmap did not use HBM, but chose GDDR6. Tenstorrent and Jim Keller both have one goal, which is to break the expensive myth of HBM.




A niche solution to save the country




Tenstorrent isn't the only company looking to replace HBM with other memory options.


In February 2024, Groq, founded by Jonathan Ross, the first-generation designer of Google TPU, officially announced that its new generation of LPU has doubled the inference speed of GPU in multiple public tests at almost the lowest price. Subsequent third-party test results showed that the chip has significant optimization effect on large language model inference, and the speed is 10 times faster than NVIDIA GPU.


According to people familiar with the matter, the LPU works very differently from the GPU. It uses a temporal instruction set computer architecture, which means it does not need to load data from memory as frequently as a GPU using high bandwidth memory (HBM). Groq chose SRAM, which is about 20 times faster than the memory used by GPUs. This feature not only helps avoid the problem of HBM shortage, but also effectively reduces costs.


A user engaged in artificial intelligence development praised Groq as a "game changer" in the pursuit of low-latency products, which refers to the time required from processing a request to getting a response. Another user said that Groq's LPU is expected to achieve a "revolutionary improvement" in the demand for GPUs in artificial intelligence applications in the future, and believed that it could become a powerful alternative to the "high-performance hardware" of Nvidia's A100 and H100 chips.


But Groq is not without its shortcomings. The main reason for choosing SRAM is that it is only responsible for inference, not training, and the storage space required for inference is much smaller than that for training. Therefore, Groq's single board card only has 230MB of memory. Although SRAM is indeed faster than DRAM, it is very expensive and has a small capacity. LPU and even other computing chips need to make some trade-offs when using SRAM in large quantities.


NEO Semiconductor, founded in San Jose, California in 2012, has proposed its own HBM alternative. Recently, the company announced that it has developed a three-dimensional DRAM with additional neuron circuits that can accelerate AI processing by avoiding data transfer from high-bandwidth memory to the GPU.


Neo's 3D DRAM technology is the basis of its 3D X-AI 300-layer, 128 Gbit DRAM chips, each with 8,000 neurons and 10 TBps of AI processing power. The capacity and performance of the 3D X-AI chip can be expanded 12 times, with up to 12 3D X-AI chips stacked, just like high-bandwidth memory (HBM), providing 192 GB (1,536 Gb) of capacity and 120 TBps of processing throughput.


“Typical AI chips use a processor-based neural network. This involves combining high-bandwidth memory to simulate synapses to store weight data, and a graphics processing unit (GPU) to simulate neurons to perform mathematical calculations,” said Andy Hsu, founder and CEO of NEO Semiconductor, in a statement. “Performance is limited by the data transfer between the HBM and the GPU, and the back-and-forth data transfer reduces the performance of the AI ​​chip and increases power consumption.”


3D X-AI simulates an artificial neural network (ANN), including synapses for storing weight data and neurons for processing data, which Neo says makes it ideal for accelerating next-generation AI chips and applications, with Hsu adding: "AI chips with 3D X-AI use memory-based neural networks. These chips have neural network capabilities, with synapses and neurons in each 3D X-AI chip. They are used to significantly reduce the heavy workload of data transfer between the GPU and HBM when performing AI operations. Our invention significantly improves the performance and sustainability of AI chips."


Previously, NAND suppliers such as SK Hynix and Samsung have tried computing memory, but the use cases were too niche to justify mass production. Neo hopes that AI processing can become so popular that it will go far beyond this niche phenomenon. It says that the 3D X-AI chip can be used with standard GPUs to provide faster AI processing at a lower cost.




HBM, not so solid?




For memory manufacturers, especially SK Hynix, HBM is a windfall after years of persistence. In fact, even Samsung, which has been leading the industry for more than 30 years, misjudged and missed the opportunity on the eve of the AI ​​wave.


HBM has flourished because of AI and plays an indispensable role in large models. This is beyond doubt, but HBM is also facing various challenges, especially as more cost-effective solutions continue to emerge. If HBM cannot reduce costs through other means, its future position may be a bit dangerous.


Click here???? to follow us and lock in more original content

END


*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.



Today is the 3842nd content shared by "Semiconductor Industry Observer" for you, welcome to follow.


Recommended Reading


"The first vertical media in semiconductor industry"

Real-time professional original depth

Public account ID: icbank


If you like our content, please click "Reading" to share it with your friends.

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号