Nvidia GB200 chip is OK? At least not a big problem!

Latest update time：2024-08-15

Reads：

????If you hope to meet more often, please mark the star ?????? and add it to your collection~

Source: Compiled from theregister

Foxconn executives claim that a small number of GB200 systems will begin shipping in the fourth quarter, but Nvidia's alleged Blackwell supply issues may not be as serious as initially thought.

"We are developing and preparing for production of new AI servers as planned, and expect to start small-batch shipments in the last quarter of 2024 and increase production in the first quarter of next year," Foxconn spokesman Wu Xiaohui said in a note.

However, Wu hinted that the product's timeline may have changed, noting that it is normal for shipping schedules to change when specifications and technology are upgraded.Whether or not this is indeed the case with Nvidia's Blackwell parts, Wu insisted that Foxconn will be the first supplier of the GB200 accelerator.

Released this spring, the GB200 is the second generation of Nvidia's Grace superchip family, featuring a pair of 1,200W Blackwell GPUs and a 72-core Grace CPU. There are 36 of them in the full GB200 superchip (for a total of 72 GPUs), designed to fit into 18 1U servers, all interconnected by a high-speed NVLink switch fabric. The system, called DGX NVL72, has 13.5TB of HBM3e and 1.44 exaFLOPS of FP4 performance.

The comments from Foxconn executives come just a week after reports that Nvidia had warned Microsoft that shipments of its Blackwell GPUs would be delayed until the first quarter of 2025.

Nvidia and its manufacturing partner TSMC may have run into challenges with the advanced packaging technology used to stitch together HBM3e memory modules for computing chips, the people said. To make matters worse, CoWoS production capacity remains extremely limited, and TSMC CEO Wei Zhejia warned that AI chip shortages could continue until 2025.

As a result, Nvidia will allegedly prioritize its flagship GB200 parts over the lower-spec HGX B100 and B200 configurations, and will bring to market a stripped-down version of Blackwell dubbed the B200A. That chip will allegedly be monolithic and feature four HBM stacks, making it about half the size of the chip we saw this spring.

In response to The Register's report, an Nvidia spokesperson reiterated that Blackwell's large-scale sampling has begun and production is expected to increase in the second half of the year.

Nvidia had previously promised that Blackwell would begin shipping to customers in the second half of 2024. At the time, that led us to believe that a handful of Blackwell chips would hit the market in the fourth quarter, with the vast majority shipping to customers in 2025.

There’s also the matter of Nvidia’s H200, which only started shipping in volume in the third quarter. These parts are essentially bandwidth-boosted versions of the veteran H100, with 141GB of HBM3e and 4.8TB/s of memory bandwidth. These factors should make the H200 a popular choice for large language model (LLM) inference, where performance is largely limited by memory bandwidth and capacity.

However, the H200 also poses a potential problem for Nvidia's upcoming B200A. If Nvidia cuts the original B200 in half, it would have a capacity of 96GB and a memory bandwidth of 4TB/s.

The performance gains from the B200A are likely to be modest, as the top-spec part only has 2.5x the 8-bit floating point performance of its Hopper counterpart. If that were halved, it would probably only be a 25% improvement. Of course, if Nvidia keeps the 1,000W power target for the B200, the performance gains could be higher, depending on how much higher they can push the clock frequencies.

That being said, if Nvidia did run into production challenges and now has a bunch of Blackwell chips that it can't stitch together, scaled-down versions would be a very easy way to salvage existing stock, especially if they could be sold at a lower cost.

Previous rumor: Nvidia delays Blackwell GPU due to packaging issues

As reported by The Information, GPU giant Nvidia recently informed Microsoft that the release of the most advanced model in the Blackwell family will be delayed. We have reached out to Nvidia for confirmation.

The issue could mean delays of three months or more in volume shipments of chips like the Blackwell B200, disrupting plans for customers like Microsoft and Meta, which have reportedly ordered billions of dollars worth of new GPUs to power their AI services.

It also means Nvidia may have to cancel or delay certain products in order to focus available silicon supply on products it considers the highest priority.

According to a report by semiconductor research firm SemiAnalysis, the main reason for the GPU shipment delay is related to Nvidia's physical design of the Blackwell series. Specifically, Blackwell is the first mass-produced design to use Nvidia's chipmaker TSMC's CoWoS-L packaging technology.

CoWoS is a method of designing more complex and advanced products using interconnected chips, typically a system-on-chip (SoC) and one or more high-bandwidth memory (HBM) chips.

However, the level of complexity of CoWoS-L is completely different from CoWoS-S, in which the chips are mounted on a relatively simple silicon interposer.

CoWoS-L uses an organic interposer as a redistribution layer (RDL) to route signals between the chips on top, leveraging local silicon interconnects (LSIs) and bridge chips embedded in the interposer.

SemiAnalysis said that in order to scale CoWoS packaging to a size larger than the AMD MI300 GPU, an organic interposer is needed because silicon is brittle and handling very thin silicon interposers becomes more difficult as the interposer gets larger. LSI and bridge chips help compensate for the poor electrical performance of organic interposers.

However, analysts say the technology has also seen some issues. One of them is that embedding multiple silicon bridges in the interposer can lead to thermal expansion mismatches between the silicon wafer, silicon bridges, organic interposer and substrate, causing the substrate to bend, which can destroy the connection.

However, according to the SemiAnalysis report, the main reason for the delay is the bridge chip, which is believed to need to be redesigned, along with redesigns of the top few global routing metal layers and the bumps of the Blackwell chip itself.

Furthermore, as has been reported many times, TSMC does not have enough CoWoS packaging capacity to meet demand. SemiAnalysis says the problem is that TSMC has built up CoWoS-S capacity over the past few years, mainly to serve Nvidia, but now the GPU maker is shifting its products to CoWoS-L.

While TSMC is building new CoWoS-L production fabs, the semiconductor contract manufacturer urgently needs to convert its older CoWoS-S capacity to meet demand.

Nvidia, meanwhile, has to choose how to use the supply TSMC provides. As a result, Semi said it sees the company focusing almost entirely on the GB200 NVL36/72 rack-scale system, with the HGX form factor for the B100 and B200 "effectively canceled now, except for some initial lower volumes."

To meet demand, Nvidia will also bring to the market a Blackwell GPU called the B200A, which is based on the B102 chip, which is also used exclusively in Nvidia's "China-exclusive" B20 GPU. According to SemiAnalysis, this B102 is a monolithic chip with 4 HBM stacks, allowing the chip to be packaged on CoWoS-S instead of CoWoS-L.

None of this is likely to hurt Nvidia too much. Financial news site Barron’s says the GPU guru could see billions of dollars in revenue in early 2025 instead of late 2024, but customers still won’t be able to get all the Hopper chips they want, so the company may just make more of them.

However, Nvidia may face more problems with the B20. According to the South China Morning Post, Washington is considering further tightening export restrictions to prevent the new GPU from being sold in its target market, China.

Late last year, U.S. Commerce Secretary Gina Raimondo warned that the United States must continue to tighten restrictions to prevent its export controls on artificial intelligence chips from being circumvented.

“If you redesign the chip around a specific cut line to enable AI, I’ll be controlling it the next day,” she said at the time.

An Nvidia spokesperson did not deny the reports, but told The Reg: "As we have said before, demand for Hopper is very strong, broad sampling of Blackwell has begun, and production is expected to ramp in the second half of the year. Beyond that, we do not comment on rumors."

We note that in March, Nvidia told us that Blackwell processors would start shipping in the second half of this year, though it was vague about the timeline — and it remains so. Rising production “as planned” this year could still mean the company will launch the chips later than the industry expects, in 2025, as the aforementioned report claims.

In short, Blackwell is likely to be delayed as rumored, but on the other hand, Nvidia hasn't disclosed when this silicon will be available.

Reference Links

https://www.theregister.com/2024/08/14/nvidia_foxconn_blackwell/

Click here???? to follow us and lock in more original content

END

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 3855th content shared by "Semiconductor Industry Observer" for you, welcome to follow.