Chiplet stacking: Intel and AMD's different approaches

Latest update time：2024-10-25

Reads：

????If you hope to meet more often, please mark the star ?????? and add it to your collection~

Source: Content compiled from theregister, thank you.

Shortly after AMD launched its first-generation Epyc processors, code-named Naples, in 2017, Intel quipped that its rival had been reduced to gluing a bunch of desktop chips together in order to stay relevant.

Unfortunately for Intel, the comment was already outdated, as just a few years later the x86 giant was looking for its own glue.

Intel's Xeon 6 processors, which began rolling out in phases this year, are its third generation of multi-chip Xeon processors and its first data center chips using a heterogeneous chip architecture similar to AMD's own.

While Intel eventually recognized the wisdom of AMD's chiplet strategy, its approach was radically different.

Breaking through the line limit

A quick review of why so many CPU designs are moving away from monolithic architectures boils down to two factors: reticle limitations and yields.

Generally speaking, without major improvements in process technology, more cores necessarily mean more silicon. However, there is a practical limit to how big a chip can actually be - we call it the reticle limit - which is about 800 square millimeters. Once you reach that limit, the only way to continue to scale computing is to use more chips.

We're seeing this technique used in many products now (not just CPUs), which cram two large chips into a single package. Gaudi 3, Nvidia's Blackwell, and Intel's Emerald Rapids Xeons are just a few examples.

The problem with multiple chips is that the bridges between them are often bottlenecks in terms of bandwidth and can introduce additional latency. This is usually not as bad as spreading the workload across multiple sockets, but it is one reason some chip designers prefer to use fewer, larger chips to scale computation.

However, it is indeed expensive to make larger chips, as the larger the chip, the higher the defect rate. This makes using lots of smaller chips an attractive proposition, and explains why AMD's designs use so many of them -- as many as 17 in the latest Epycs.

With those basics out of the way, let’s take a deeper look at the different design philosophies behind Intel and AMD’s latest Xeons and Epyc processors.

AMD's approach

We'll start with AMD's fifth-generation Epyc Turin processor. Specifically, we're looking at the 128-core Zen 5 version of the chip, which features 16 4nm core complex die (CCDs) surrounding a single I/O die (IOD) manufactured on TSMC's 6nm process technology.

AMD's latest Epycs come with up to 16 compute chips

If this sounds familiar, it's because AMD used the same basic formula with its second-generation Epyc processors. For reference, the first-generation Epyc lacked a unique I/O die.

As we mentioned earlier, using lots of smaller compute chips means AMD can achieve higher yields, but it also means they can share silicon between Ryzen and Epyc processors.

If these chips look familiar, that's because AMD's Epyc and Ryzen processors actually share the same computing die.

Furthermore, with eight-core or sixteen-core CCDs (each with 32MB of L3 cache), AMD gains additional flexibility in scaling core counts in proportion to cache and memory.

For example, if you want an Epyc with 16 cores (a common SKU for HPC workloads due to licensing restrictions), the most obvious way to do it is to use two eight-core CCDs with 64MB of L3 cache between them. However, you can also use 16 CCDs with only one core active per CCD but with 512MB of onboard cache. This may sound crazy, but both chips exist.

AMD's fifth-generation Epycs follow the familiar pattern of 16 compute chips surrounding a central I/O die.

On the other hand, the I/O chip is responsible for almost all functions except computing, including memory, security, PCIe, CXL and other I/O (such as SATA), and also serves as the backbone for communication between the chip CCD and other slots.

Here is a detailed introduction to the AMD Epyc Turin I/O chip.

Placing the memory controller on the I/O die does have some advantages and disadvantages. On the plus side, it means that memory bandwidth scales largely independently of the number of cores. The downside is that memory and cache access latencies may be higher for certain workloads. We emphasize "may" because this kind of thing is highly workload dependent.

Intel Xeon chiplet journey

Speaking of Intel, the chipmaker's approach to multi-die silicon is very different from AMD's. While modern Xeon processors use a heterogeneous architecture with different compute and I/O chips, this wasn't always the case.

Intel's first multi-chip Xeon processor, code-named Sapphire Rapids, comes in either a single, medium-core-count die or four extreme-core-count dies, each with its own memory controller and onboard I/O. Emerald Rapids follows a similar pattern, but opts for two larger dies for the higher-core-count SKUs.

As you can see between Sapphire and Emerald Rapids, Intel switched from four medium chips to a pair of nearly mesh-like limited chips.

All that changed with the introduction of Xeon 6, when Intel moved the I/O, UPI links, and accelerators to a pair of chips built on Intel's 7 process node, which sit between the one and three compute chips in the center built on Intel's 3.

For reasons we'll get to later, we'll focus primarily on Intel's more mainstream Granite Rapids Xeon 6 processors, rather than its multi-core Sierra Forest parts.

Looking at Intel’s compute chips, we can see the first major difference from AMD. Each compute module has at least 43 cores onboard, with fusion turned on or off depending on the SKU. This means that Intel needs far fewer chips than AMD to achieve 128 cores, but the yield rate may be lower due to the larger area.

Depending on the SKU, Granite Rapids uses one to three compute chips sandwiched between a pair of I/O chips.

In addition to adding cores, Intel has chosen to put the memory controller for these chips on the compute die, with each die supporting four channels. In theory, this should reduce access latency, but it also means that if you want all 12 memory channels, you need to populate all three dies.

For the 6900P series parts we looked at last month, you didn’t have to worry about this, as each SKU came with three compute chips. However, this meant that the 72-core version only utilized a small portion of the silicon in the package. The same was true for the 16-core HPC-centric Epyc we discussed earlier.

On the other hand, Intel's 6700P series parts, which will be available early next year, will feature one or two compute dies, depending on the memory bandwidth and core count required, which means that memory channels will be limited to 8 at the high end, and possibly as few as 4 in configurations with a single compute die on board. We don't know the memory configurations on the HCC and LCC dies yet, so it's possible that Intel has beefed up the memory controllers on these parts.

Like AMD's Epyc, Intel's Xeon now uses a heterogeneous chip architecture with compute and I/O chips.

Intel's I/O die is also fairly thin and contains a combination of PCIe, CXL, and UPI links for communicating with storage, peripherals, and other slots. In addition to this, we also found a number of accelerators for direct streaming (DSA), in-memory analytics (IAA), encryption/decryption (QAT), and load balancing.

We're told that part of the reason for putting accelerators on I/O chips is to get them closer to the data going in and out of the chip.

Where do we go next?

On the surface, Intel's next-generation multi-core processor, code-named Clearwater Forest, is expected to be launched in the first half of next year. Its model is similar to Granite Rapids, with two I/O modules and three computing modules.

It might look like a scaled-down version of Granite Rapids, but that's obviously just structural silicon hiding more chips.

However, appearances can be deceiving. As far as we can tell, the three computing chips are actually just structural silicon that hides many smaller computing chips, which themselves sit atop an active silicon interposer.

Clearwater Forest uses up to 12 compute chips per package, according to renderings Intel showed earlier this year. Using a silicon interposer is by no means new, and it offers many benefits, including higher bandwidth between chips and lower latency than typically seen in organic substrates. This is very different from the pair of 144-core compute chips on Intel's highest-core-count Sierra Forest parts.

If Intel's renderings released earlier this year are anything to go by, Clearwater Forest has far more chips hidden than Granite Rapids.

Of course, discussing renderings of the technology that will be used in Clearwater Forest doesn’t mean we’ll get the exact same technology when it arrives next year.

Perhaps the bigger question is where AMD will take its chiplet architecture next. Looking at AMD's 128-core Turin processor, there isn't a lot of room on the package for more silicon, but House of Zen still has some options.

First, AMD could opt for a larger package to make room for the extra die. Alternatively, the chipmaker could pack more cores onto a smaller die. However, we suspect AMD's sixth-generation Epycs may end up looking more like its Instinct MI300 series of accelerators.

The MI300A combines 24 Zen 4 cores, 6 CDNA 3 GPU chips and 128GB of HBM3 memory into a single package, designed to meet the demands of HPC workloads

You may recall that the MI300X GPU was launched alongside an APU that swaps out the chip's two CDNA3 modules for three CCDs with 24 Zen 4 cores in between. These compute modules are stacked on four I/O chips and connected to a set of eight HBM3 modules.

Now, this is just speculation, but it’s not hard to imagine AMD doing something similar and swapping out all the memory and GPU chips for additional CCDs. Such a design would likely also benefit from higher bandwidth and lower latency for inter-chip communication.

Whether this will actually come to fruition, only time will tell. We expect AMD's 6th Gen Epycs to be available by the end of 2026.

Original link

https://www.theregister.com/2024/10/24/intel_amd_packaging/

END

????Semiconductor boutique public account recommendation????

▲Click on the business card above to follow

Focus on more original content in the semiconductor field

▲Click on the business card above to follow

Focus on the trends and developments of the global semiconductor industry

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 3926th content shared by "Semiconductor Industry Observer" for you, welcome to follow.