Article count:25239 Read by:103424336

Account Entry

Power consumption has become a big problem for chips!

Latest update time:2024-05-01
    Reads:

????If you want to see each other often, please mark it as a star???? and add it to your collection~

Source : Content comes from Semiconductor Industry Observation (ID: i c bank) compiled from fierceelectronics, thank you.




Nvidia's latest giant chip, Blackwell, is a modern marvel. It has 200 billion transistors, and when thousands of GPUs are combined together in large data centers, it is expected to provide enough processing power to handle the largest AI models.


But Blackwell and other powerful accelerator chips coming to market have people nervous, especially data center operators and power companies, and even regulators around the world. One version of a single Blackwell chip used in data centers consumes 1,200 watts of power, which is an insane amount of power compared to just a few years ago. Largely due to the growth of accelerator chips, some data centers are building their own power plants to handle the load, while regulators in Amsterdam and other European cities have told data centers they cannot scale due to limited power supplies.


It's not just Nvidia's GPUs that are huge. Blackwell is part of a trend among all chip design companies. Even hyperscalers and automakers like Tesla are designing their own custom chips, often pushing the laws of physics through 3D designs and chiplets to improve energy efficiency. Tesla's Dojo chip has 25 chiplets. These chip design methods help improve energy efficiency, but at the same time, data centers continue to evolve to support artificial intelligence, including GenAI. Currently, 1.5% to 2% of the world's electricity is used in data centers, with the vast majority of the energy used in the chips and circuit boards that support them. Data center energy consumption is growing like a hockey stick.


"This trend is unsustainable"


“The chip industry has been on an unsustainable trend,” said Henri Richard, a chip industry veteran and president of Rapidus Americas. The company is building a 2nm process node chip factory in northern Japan and has received billions of dollars in support from the Japanese government.


"A few years ago we said we couldn't go to 150 watts, and now we're at 1,200 watts! Something needs to change. If you think about taking this growth curve and projecting into the future, we can't have 3-kilowatt chips," Li said. Chad said in an interview at his U.S. offices in Santa Clara, California.


He said that shrinking the chip process node from 10nm to 5nm to 2nm is part of the solution. However, as the benefits of Moore's Law diminish, "systems and chips need to be built differently to handle the concentration of power and the amount of cooling that can be done," he added. "Even immersion cooling is difficult to power the chip. Chiplets will be a way to balance the front and back ends."


Arm CEO Rene Haas recently wrote in a blog that future artificial intelligence workloads will become larger and larger, crying out for more compute and more capabilities, which wakes up Some elements of the artificial intelligence world. "Finding ways to reduce the power demands of these large data centers is critical to achieving breakthroughs in society and realizing the promise of artificial intelligence," he said. "In other words, without electricity, there is no artificial intelligence."


What challenges do data center power-hungry chips face?


In data centers with thousands of Blackwell chips and other processors, the power load becomes so huge that engineers have to find available power where there isn't enough, even with the help of renewable energy sources such as solar, wind, and hydropower. The same goes for under or geothermal. Once enough power is delivered to developable land in areas like Loudoun County, Virginia, west of Washington, D.C., anxiety will grow over what happens inside dozens of hot server racks.


Engineers are coming up with new ways to keep circuit boards and chips cool enough to prevent them from catching on fire or melting, causing disaster for critical data, expensive equipment and corporate profits.


An entire industry has emerged aimed at cooling data centers to prevent the heat generated by servers and their power-hungry chips. Liquid cooling of server racks has become an art form; one of the latest approaches is an immersive experience throughout the data center, which raises delicate propositions about how data centers can connect underwater power to the humans around them. Meanwhile, hyperscalers are planning to build small nuclear reactors or other generators near their data center hubs to ensure reliable and abundant energy supplies.


Investors are frantically seeking more power for data centers: OpenAI CEO Sam Altman just invested $20 million in Exowat, an energy startup focused on artificial intelligence data centers. Keeping the chip cool enough for optimal operation may also require air cooling technology, which consumes more power, exacerbating the problem. Even so, as a rule of thumb, half of the power needed in a data center goes to lighting up the processors -- from GPUs to CPUs to NPUs and whatever becomes the next chip's TLA. Associated circuits and circuit boards increase energy consumption.


Nvidia's Jensen Huang defines long-term vision for AI accelerator


Nvidia CEO Jen-Hsun Huang and many other semiconductor leaders have demonstrated that the power-mongering legitimacy of modern accelerator chips like Blackwell may be justified when it comes to matching the vast computing power of AI and GenAI, and how these technologies will be useful for generations to come. Company and customer impacts include research and development of new pharmaceuticals, climate analysis, self-driving cars and robots, and more. He and his engineering team often talk about the laws of physics and recognize which metals and other materials and chip architectures can dissipate the heat generated by electricity into server racks and then distribute them across acres of server racks.


Modern chip design has Nvidia, Intel, AMD, Qualcomm, cloud providers and a growing number of small design companies increasing the density of circuit boards so that servers and server racks take up less floor space while each can Producing multiple teraflops of floating point operations. There are more servers than a year ago. Performance per watt metrics are often expressed as TFLOPS/Watt to facilitate comparison of systems and chips from different vendors.


Jen-Hsun Huang's CadenceLIVE speech on verticality


Huang spoke about this density and its associated power consumption at CadenceLIVE Silicon Valley in April, speaking in abstract terms about how the benefits of AI across the user base justify this computing density. “Remember, you design a chip once, but you ship it a trillion times,” he said during the fireside chat. "You design a data center once, but you can save 6% on electricity...that's what a billion people enjoy." Of course, Huang is talking about the entire ecosystem, far beyond a single device for the broader accelerated computing category. Blackwell or other GPU wattage. It took him a few sentences to make his point, but it's worth reading:


"The power consumption of accelerated computing is very high because the density of computers is very high," Huang said. "Whatever optimizations we do on power utilization can directly translate into higher performance, higher productivity, generate revenue or directly translate into savings. For the same performance, you can get something smaller. In accelerated computing Power management translates directly to everything you care about.

“加速计算需要数以万计的通用服务器,消耗了 10 倍、20 倍的成本和 20 倍、30 倍的能源,并将其缩减为极其密集的东西。因此,加速计算的密度是人们会认为它耗电且昂贵的原因。但如果您从已完成的工作或吞吐量的 ISO(国际标准)来看,实际上您可以节省大量资金。这就是为什么随着 CPU 扩展速度减慢,我们必须转向加速计算,因为无论如何你都不会继续以传统方式扩展。加速计算至关重要。”


随后,在与 Cadence 首席执行官 Anirudh Devgan 的同一次对话中,黄仁勋补充道:“人工智能实际上可以帮助人们节省能源……如果没有人工智能,我们如何能够实现 6% 的节能(在 Cadence 的一个示例中)或 10 倍的节能?因此,您对模型的训练进行一次投资,然后数百万工程师就可以从中受益,数十亿人将在几十年内享受到节省的费用。


“这就是考虑成本和投资的方式,不仅要根据具体情况,而且就医疗保健而言,要纵向考虑。您必须……纵向地考虑节省资金、节省能源,不仅要考虑您正在构建的产品的整个范围,还要考虑您设计产品的方式、您构建的产品以及感受到的产品的影响。当你像这样纵向看待它时,人工智能将在帮助我们应对气候变化、使用更少的电力、提高能源效率等方面带来彻底的变革。”


Nvidia 之外的声音


除了黄仁勋之外,芯片设计和芯片生产领域的其他杰出人物最近也纷纷加入进来。台积电首席执行官 CC Wei 在公司最新的财报电话会议上这样说道:“几乎所有的人工智能创新者都在与台积电合作,以满足永不满足的人工智能相关需求以获得节能的计算能力。”关键词:“贪得无厌”。


Cadence 首席执行官 Devgan在与黄仁勋的台上对话中指出,人工智能模型可以拥有 1 万亿个参数,而人脑中有 100 万亿个突触或连接。他预测,有人按照人脑的顺序构建一个非常大的人工智能模型只是时间问题。他说,这样做将需要“大量的软件计算、整个数据搜索基础设施和整个能源基础设施”。


Cadence 制定并支持多种提高加速器能效设计的方法(Nvidia 曾使用加速器开发 Blackwell),并开发了数字孪生系统来帮助数据中心更高效地设计其运营。


AMD设定的目标是,以 2020 年加速计算节点为基准,到 2025 年将其产品的能效提高 30 倍。去年推出的 MI300X 加速器使该公司更接近这一目标。AMD 高级副总裁兼产品技术架构师 Sam Naffziger 去年发布的一篇博客描述了这一进展。


Naffziger 警告说,该行业不能仅仅依赖较小的晶体管,需要一个整体设计视角,包括封装、架构、内存、软件等。


英特尔的神经形态推动


英特尔还积极推动能源效率的发展,最近宣布它已经建立了世界上最大的神经形态系统来实现可持续的人工智能。它的代号为 Hala Point,采用英特尔的 Loihi 2 处理器,每秒可支持多达 20 万亿次运算,可与 GPU 和 CPU 相媲美。到目前为止,它的应用显然是用于研究。



英特尔对 Hala Point 的描述称,整个系统的最大功耗为 2,600 瓦,是 Nvidia Blackwell 的两倍多一点:“Hala Point 将 1,152 个在英特尔 4 工艺节点上生产的 Loihi 2 处理器封装在一个六机架单元中。数据中心机箱有微波炉大小。该系统支持分布在 140,544 个神经形态处理核心上的多达 11.5 亿个神经元和 1,280 亿个突触,最大功耗为 2,600 瓦。它还包括 2,300 多个用于辅助计算的嵌入式 x86 处理器。”


英特尔首席产品可持续发展官 Jennifer Huffstetler 通过电子邮件告诉 Fierce Electronics:“英特尔正在将未来的计算技术视为人工智能工作负载的解决方案,即神经形态,有望以更低的功耗提供更高的计算性能。计算需求只会不断增加,尤其是新的人工智能工作负载。为了提供所需的性能,GPU 和 CPU 的功耗也在增加。”


英特尔已经采取了三管齐下的方法来提高效率,包括优化人工智能模型、软件和硬件。Huffstetler 估计,在硬件方面,英特尔的创新从 2010 年到 2020 年已节省 1000 太瓦时。Gaudi 加速器的能效提高了约一倍,而 Xeon 可扩展处理器的能效提高了 2.2 倍。(Xeon 专为数据中心、边缘和工作站工作负载而设计。)她声称,即将推出的 Gaudi 3 加速器的推理能力平均提高 50%,推理功效平均提高 40%。英特尔还涉足液冷业务,与数据中心内的风冷相比,该业务可节能 30%。


是的,更高的“效率”,但是……


尽管主要芯片设计者付出了所有努力,功耗困境仍然存在。是的,数据中心可能拥有更少的配备最新加速器的机架,从而降低功耗,但人工智能的增长意味着公司只会寻求扩展计算能力——更多的服务器、更多的机架、更多的能源消耗。J. Gold Associates 的创始分析师杰克·戈尔德 (Jack Gold) 表示:“是的,较新的芯片每瓦性能更高,但人工智能模型也在不断增长,因此目前尚不清楚对功耗的总体要求是否会下降那么多。”


虽然采用液冷机架的 GB200 外形尺寸的 Blackwell 每个芯片的功耗为 1200 瓦,但 Gold 指出,典型的 AI 芯片仅使用一半的功率 - 650 瓦。他这样计算能耗:加上内存、互连和 CPU 控制器,每个模块的能耗可跃升至 1 千瓦。在最近的 Meta 示例中,该公司一度部署了 10,000 个模块(未来还会有更多),仅这一数量就需要 10 兆瓦的电力。一个像克利夫兰这样大小、拥有 300 万人口的城市的用电量约为 5,000 兆瓦,因此本质上,一个如此规模的数据中心将占用该城市 2% 的电力。典型的发电厂可发电约 500 兆瓦。


戈尔德说:“最重要的是,人工智能数据中心确实[面临着问题],试图找到有足够电力且电力成本足够低的区域来满足其所需的消耗。”电力成本是数据中心中仅次于设备资本成本的最大支出。


Technaanalysis 的创始分析师 Bob O'Donnell 表示,他在一定程度上理解黄仁勋在 Cadence 活动中提出的支持 AI 芯片功耗的“纵向”论点。“加速器芯片确实需要更多的能量,但从长远来看,由于你所学到的一切,从长远来看对环境、制药和其他领域有更积极的好处,”他告诉 Fierce。“它们非常密集,但与其他选项相比,它们更节能。”


"To summarize, the capabilities of AI chips are getting a lot of attention and attention from many different players. It's not going to be solved or go away with the huge need for more power. But GenAI's capabilities are so powerful that People feel the need to pursue it.”


Reference links:


https://www.fierceelectronics.com/ai/power-hungry-ai-chips-face-reckoning-chipmakers-promise-efficiency


Click here to follow and lock in more original content

END


*Disclaimer: This article is original by the author. The content of the article is the personal opinion of the author. The reprinting by Semiconductor Industry Watch is only to convey a different point of view. It does not mean that Semiconductor Industry Watch agrees or supports the view. If you have any objections, please contact Semiconductor Industry Watch.



Today is the 3752nd issue of "Semiconductor Industry Observation" shared with you. Welcome to pay attention.


Recommended reading


"Semiconductor's First Vertical Media"

Real-time professional original depth

Public account ID: icbank


If you like our content, click "Watching" to share it with your friends.

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号