How to play with the world's largest chip? Hotchips exciting continues!

Latest update time：2023-08-30

Reads：

In yesterday's report "Chip Giants, Compete for Hotchips" , we introduced some of the processor giants on hotchips. Entering the third day of the conference, these leading companies are continuing to bring more displays. First of all, what attracted the most attention was the introduction of Cerebras, the world's largest chip builder, about its cluster.

When it comes to AI startups, Cerebras has been our frontrunner for years to get to the next stage. Now, they appear to have exited a group of startups scaling their giant wafer-scale engines to AI supercomputer scale. At Hot Chips 2023, the company detailed the new clusters it plans to use to dwarf what NVIDIA is building.

Cerebras wafer-scale cluster details

Cerebras started his presentation by giving an update on the company and how AI/ML models are getting bigger (approximately 40,000 times in 5 years). They also discuss some history of ML acceleration.

Over time, process technology has come a long way.

Architectural benefits such as changing calculations from FP32 to bfloat16, INT8 or other formats and technologies also bring huge benefits.

Still, what models are practical depends not only on the ability to gain benefits at the chip level, but also at the cluster level.

Some of the current challenges with scaling out are simply the communication requirements to keep data moving to smaller compute and memory nodes.

Cerebras built a giant chip to get orders of magnitude improvements, but it also needed to scale to clusters because one chip wasn't enough.

Traditional horizontal scaling faces challenges as it attempts to spread problems, data, and computation across so many devices.

On GPUs, this means using different types of parallelism to scale to more compute and memory devices.

Cerebras is looking to scale cluster-level memory and cluster-level compute to decouple compute and memory scaling, as seen on GPUs.

Cerebras has 850,000 cores based on WSE-2.

Cerebras places WSE-2 in CS2 and connects it to MemoryX. It can then transfer the data to a large chip.

Then, it has the SwarmX interconnect that performs data-parallel scaling.

Weight is never stored on the wafer. They just flow in.

SwarmX fabric adjusts weights and reduces gradients on returns.

There are 12 MemoryX nodes per MemoryX unit. State is stored in DRAM and flash memory. Up to 1TB of DRAM and 500TB of flash memory. Interestingly, the CPU is only a 32-core CPU.

Finally, connect to the cluster using 100GbE. One port is connected to the CS-2 and the other port is connected to other MemoryX modules.

MemoryX must handle sharding of weights in a thoughtful way for it to work. Sorting the stream helps perform almost free transposition.

In MemoryX, there is a high-performance runtime to transfer data and perform calculations.

The SwarmX fabric uses 100GbE and RoCE RDMA to provide connectivity and reduce broadcasts that occur on the CPU.

Each broadcast reduction node has 12 nodes with 6 100GbE links. Five of them are used for 1:4 broadcasts and redundant links. This means that 150Tbps broadcast will reduce bandwidth.

100GbE is interesting because it is a very commoditized interconnect now compared to NVLink/NVSwitch and InfiniBand.

Cerebras is doing these operations on CS-2/WSE, which helps achieve this scale.

This is the SwarmX topology.

The flexibility of the fabric can be used to efficiently configure work across clusters while supporting sub-cluster partitioning.

Cerebras is using 16x CS-2 to quickly train large models on Andromeda. Then Cerebras went even bigger with the Condor Galaxy-1 wafer-scale cluster. Cerebras has modified the BTLM on the current top-of-the-line 3B model.

Next, Cerebras is scaling to larger clusters.

Google’s crazy optically reconfigurable artificial intelligence network

At Hot Chips 2023, Google demonstrated its crazy optically reconfigurable artificial intelligence network. The company is doing light-path switching to achieve better performance, lower power consumption and greater flexibility in its artificial intelligence training clusters. What’s even more amazing is that they’ve been making this product for years.

According to Google, its main goal is to bundle Google TPU chips together.

This is the 7nm Google TPUv4. TPU v4i is the inference version, but this is more of a focus discussion of TPUv4.

Google says it overprovisions the power supply compared to typical power supplies so it can meet the 5ms service time SLA. So the TDP on the chip is much higher, but this is to allow bursts to meet SLA bursts.

This is the TPUv4 architecture diagram. Google built these TPU chips not just to be single accelerators, but to scale out and run as part of a larger infrastructure.

This is Google's TPUv4 vs. TPUv3 statistics, and it's one of the clearest tables we've ever seen.

Google more than doubled peak FLOPS but reduced power consumption between TPUv3 and TPUv4.

Google has built-in SparseCore accelerator in TPUv4.

This is Google's TPUv4 SparseCore performance.

The board itself has four TPUv4 chips and is liquid cooled. Google said they had to redesign their data centers and operations to switch to liquid cooling, but the power savings were worth it. The valve on the right controls flow through the liquid cooling tube. Google says it's like a fan speed controller, but for liquids.

Google also said it's returning to consoles using PCIe Gen3 x16 because that's a 2020 design.

Like many data centers, Google's power comes in from the top of the racks, but it also has many interconnections. Inside the rack, Google can use an electrical DAC, but outside the rack, Google needs to use optical cables.

Each system has 64 racks and 4096 interconnect chips. In a sense, NVIDIA's 256-node AI cluster has only half the number of GPUs.

Also at the end of the rack we see a CDU rack. Each rack is a 4x4x4 cube (64 nodes) with optical path switching (OCS) between TPUs. Within the rack, the connection is to the DAC. The faces of the cube are all optical.

The following is an introduction to OCS. Using OCS does not use electrical switches, but provides direct connections between chips. Google has in-house 2D MEMS arrays, lenses, cameras, etc. Avoiding all network overhead allows data to be shared more efficiently. BTW, this feels similar to DLP TVs in some ways.

Google says it has more than 16,000 connections and enough fiber optic distance within its super pod to circle the state of Rhode Island.

Since there is so much point-to-point communication, a large number of fiber optic bundles are needed.

In addition to this, each pool can be connected to a larger pool.

OCS can improve node utilization because it is reconfigurable.

Google can then change the topology by adjusting the optical routing.

Google demonstrates the benefits of different topologies here.

This is important because Google says changes in model requirements can drive changes to the system.

This is an extension of Google's linear acceleration on a logarithmic scale up to 3072 chips.

Google also increased the on-chip memory to 128MB to maintain local data access.

Here's how Google compares to the NVIDIA A100 on a performance-per-watt basis.

This is a PaLM model trained on 6144 TPUs in two Pods.

Intel demonstrates 8-core, 528-thread processor using silicon photonics technology

Intel showed off a cool technology at Hot Chips 2023, and it's not just server chips. It features direct mesh to mesh optical fabric. Also interesting is the 8-core processor with 66 threads per core.

The key motivation behind this is the ultra-sparse data from the DARPA HIVE project.

When Intel analyzed the workloads DARPA was studying, they found that they were massively parallel. Still, their cache line utilization is poor, and things like large, long out-of-order pipes are not well utilized.

This is an interesting one. Intel has a 66 threads per core processor with 8 cores in a socket (528 threads?) The cache is obviously not well utilized due to the workload. This is a RISC ISA, not x86.

Intel packages these into 16 sockets of a single OCP compute thread and uses fiber optic networking.

This is die architecture. Each core has a multi-threaded pipeline.

High-speed I/O chips connect the chip's electrical and optical functions.

This is the 10-port pass-through router in use.

This is the network-on-chip where the router is placed. Half of the 16 routers are there just to provide more bandwidth for high-speed I/O. Encapsulated EMIB is used for the physical connection layer.

Each chip uses silicon photonics to drive its optical network. This way, even if the chips are not in the same chassis, the connection between the cores can be made directly between the chips without the need to add switches and network cards.

These chips are multi-chip packages using EMIB packaging. Silicon photonic engines add several other challenges, from packaging to fiber optic bundles.

This is optical performance.

In terms of power consumption, this was done on an 8-core 75W CPU. More than half of the power here is used by silicon photonics.

Below is the simulated to measured workload performance scaling.

Here is a photo of the actual chip and confirmation that this is done on TSMC 7nm.

The package and test board are shown below:

This was done in a 7nm process, and work is still ongoing in the lab.

Interestingly, Intel didn't use the pluggable connector it showed off at Innovation 2022. It looks like this may have been built before the project was ready. This was aided by optical assistance from Ayar Labs.

Perhaps most importantly, 66 threads per core! This is a huge number. I think people will like this data.

Lightelligence Hummingbird low-latency optical connection engine

At Hot Chips 2023, Lightelligence demonstrated its Hummingbird low-latency optical connectivity engine.

Lightelligence said that artificial intelligence computing needs far exceed the scaling of transistors.

Lightelligence focuses on enhancing optical performance in computing. Today, the company is talking about optical NOCs.

Creating domain-specific architectures to address specific number formats, mathematics and parallelism is one of the areas of performance improvement, the company said.

Another area of improved performance is adding more silicon to the package.

Interconnects using electrical signals are considered inefficient.

Here Lightelligence is talking about on-chip optical network (oNOC). The idea is to use light rather than electrical signals on the package to increase efficiency.

It also allows for different types of topologies, as optical waveguides can span longer distances.

As the distance between chiplets increases, it allows for better scaling. This is important because chiplets are cheaper to make, but they require connections on the package.

Hummingbird is an example of using oNOC. It has an FPGA, external laser, and a 3rd party SiP packaged on the card. oNOC allows for things like all-to-all broadcasting.

Hummingbird specializes in putting optical connections on packages using photonic integrated circuits, or PICs, and electronic silicon chips.

The interposer provides power to the chip. The bottom PIC is the optical component. EIC came out on top again. The SiP is a SIMD architecture with a custom ISA with an eight-core cluster.

The biggest difference is the U-shaped structure of the optical fiber broadcast network. Using this, each core can update all other cores without any waveguide crossing in the PIC.

This is the core microarchitecture. This is an AI DSA inference core. Every other core can send its data via oNOC Receive, and each core can also send its data via oNOC Quantize and Transport.

Below are Hummingbird’s metrics.

This is a system with Hummingbird in a PCIe GPU chassis.

Below are the performance metrics.

Lightelligence says it can stitch chips of different sizes. It seems focused on building something beyond a bendable limited chip like the NVIDIA H100.

The solution is given again:

Utilizing 3D packaging will make this technology even more interesting. The next question, of course, is whether other companies will pursue this kind of chip-to-chip communication, or whether other types of technology will take over. It's an interesting technology, but a lot of the big vendors are also looking at optics.

SiFive P870 RISC-V processor unveiled

SiFive has been a major player in RISC-V for the past few years. At Hot Chips 2023, the company detailed the SiFive P870 processor.

There are standards for new RISC-V CPUs. This is an important part of SiFive messaging.

SiFive launched its first out-of-order chip in 2022 in the P550. The company now has the P650/P670 and P450/P470. Now, there's the P870 and P870-A. The A here stands for car.

SiFive is now making larger, more complex chips. The difference is that this solution leverages a shared L2 cache. Today, many Arm CPUs offer dedicated L2 cache for cloud workloads.

This is the pipeline.

This is the microarchitecture of the chip. This is more of an instruction flow chart.

This is the beginning of a walkthrough that starts at the top.

Here SiFive is handling the fusion function and a ROB of 1120 is considered an extreme case. This is basically counting bundled instructions from what it sounds like (maybe this is comparable to 280 instructions in other architectures).

Vector sorters are considered a unique feature of RISC-V.

Please note at this point that this is more complex than SiFive's old solution.

There will still be more microarchitectural details that people can read.

Below are the load/store specifications.

The L2 cache is non-inclusive, but not exclusive.

This is what the cluster topology looks like. There is a 16 cycle latency from L1 to the larger L2. This is designed for data sharing between cores in a cluster. This is an example of a 32-core chip with 8 4-core clusters.

This is a consumer-grade topology with two P870 high-performance cores, four P470s, smaller, more efficient cores in the cluster, and then a low-power E6 in-order core that is always on at low power.

Here's a slideshow of the new SiFive P870-A automotive safety features. The greater focus here is on fault detection, reliability and safety. The P870-A has features like parity in the register file and the cache has ECC capabilities as some examples of their differences.

SiFive has many different types of IP. Here is the complete list.

The next generation we'll be hearing is Napa Core.

*Disclaimer: This article is original by the author. The content of the article is the personal opinion of the author. The reprinting by Semiconductor Industry Watch is only to convey a different point of view. It does not mean that Semiconductor Industry Watch agrees or supports the view. If you have any objections, please contact Semiconductor Industry Watch.

Today is the 3508th issue of "Semiconductor Industry Observation" shared with you. Welcome to pay attention.

Latest articles about

■SiC giant, rebirth: how to predict the future?

■Apple chips may hit Qualcomm hard

■Chip cost per car: soaring to $1,000

■TSMC 2nm, important information

■Huang Renxun's latest views

■The risks of this type of chips that are promising have increased significantly!

■NPU, how to see it?

■Storage giants are abandoning DDR 4

■Intel, why?

■Nvidia will definitely be disrupted