Tesla's Dojo chip is an order of magnitude ahead of its competitors

Latest update time：2021-09-01 14:01

Reads：

Source: The content is compiled by Semiconductor Industry Observer (ID: icbank) from " semianalysis ", thank you.

Tesla just held their AI Day and revealed the inner workings of their software and hardware infrastructure. One of the parts disclosed this time is the previously disclosed Dojo AI training chip. Tesla claims that their D1 Dojo chip has GPU-level computing power, CPU-level flexibility, and network switch IO.

A few weeks ago, we speculated that the package for this system was the TSMC Integrated Fan-Out System on Wafer (InFO_SoW). We explained the benefits of this type of package and the cooling and power consumption involved in scaling training chips at this scale. Additionally, we estimated that the package would outperform the Nvidia system. All of this seemed like educated guesses. Today, we’re going to dig deeper into the semiconductor details.

Before we dive into the hardware, let’s talk about the evaluation infrastructure. Tesla is constantly retraining and improving their neural networks. They evaluate any code changes to see if there is an improvement. There are thousands of identical chips deployed in cars and servers. They run millions of evaluations per week.

Tesla has been scaling up its GPU clusters for years. If Tesla stopped all real workloads, ran Linpack, and submitted it to the Top500 list, their current training cluster would become the fifth largest supercomputer. However, this performance expansion is not enough for Tesla and its ambitions, so they started developing their own Dojo chip project a few years ago. Tesla needs higher performance to implement larger and more complex neural networks in an energy-efficient and cost-effective way.

Tesla's architectural solution is a distributed computing architecture. When we listened to their details, this architecture looked a lot like Cerberus. We analyzed the Cerebras wafer-scale engine and its architecture. Every AI training architecture is arranged in this way, but the details of the computing elements, networks, and structures vary greatly. The biggest problem with these types of networks is scaling bandwidth and keeping latency low. In order to scale the network, Tesla is particularly focused on this, which affects every part of their design, from chip materials to packaging.

The functional unit is designed to be 1 clock cycle fast, but large enough that synchronization overhead and software are not a major issue. As a result, they arrived at a design that is almost identical to Cerebras. A mesh of individual units connected by a high-speed fabric whose function within one clock is to route communication between units. Each unit has a large 1.25MB SRAM scratchpad and multiple superscalar CPU cores with SIMD capabilities, as well as matrix multiplication units that support all common data types. In addition, they introduced a new data type called CFP8, Configurable Floating Point 8. Each unit can support 1TFlop on BF16/CFP8, 64GFlops on FP31, and 512GB/s of bandwidth in each direction.

The CPU is no slouch either, it is 4 wide by 2 wide on the vector pipeline. Each core can host 4 threads to maximize utilization. Unfortunately, Tesla uses a custom ISA instead of a top open source ISA like RISC V. This custom ISA introduces instructions for transpose, gather, broadcast, and link traversal.

The full chip of these 354 functional units can reach 362 TFlops of BF16 or CFP8 and 22.6 TFlops of FP32. It has a total of 645mm² and 50 billion transistors. Each chip has an amazing 400W TDP, which means the power density is higher than most configurations of Nvidia A100 GPUs. Interestingly, Tesla has achieved an effective transistor density of 77.5 million transistors per mm², second only to mobile chips and Apple M1, and higher than all other high-performance chips.

Another interesting aspect of the basic functional unit is the NOC router. It has a very similar approach to scaling within and between chips as Tenstorrent. It is no surprise that Tesla is adopting a similar architecture to other well-respected AI startups. Tenstorrent is great for scaling training, and Tesla is focusing a lot on this aspect.

On chip, Tesla has an amazing 10TBps directional bandwidth, but this number doesn't mean much in real workloads. A huge advantage Tesla has over Tenstorrent is the much higher bandwidth between chips. They have 576 SerDes on 112GTs. This produces a total of 64Tb/s or 8TB/s of bandwidth.

We're not sure where Tesla got the 4TB/s per edge, more likely the numbers on the X and Y axes. Confusing slide aside, the bandwidth of this chip is insane. The highest external bandwidth chip known is a 32Tb/s network switch chip. Tesla was able to double that with lots of SerDes and advanced packaging.

Tesla connects the compute plane of the Dojo chip to interface processors that connect to the host system via PCIe 4.0. These interface processors also support higher radix network connections to supplement the existing compute plane mesh.

The 25 D1 chips are packaged in a “fan out wafer process.” Tesla didn’t confirm that the package is TSMC’s Integrated Fan-out System on Wafer (InFO_SoW) as we speculated a few weeks ago, but given the crazy inter-die bandwidth and their specific mention of fan-out wafer, it seems likely.

Tesla has developed a proprietary high-bandwidth connector that preserves off-chip bandwidth between these chips. Each chip has an impressive 9PFlops BF16/CFP8 and 36TB/s off-tile bandwidth. This far exceeds Cerebras' off-wafer bandwidth, making the Tesla system scale-out even better than scale-out designs (such as the Tenstorrent architecture).

The power delivery is unique, custom, and very impressive. With such a large bandwidth and over 10KW of power consumption, Tesla has innovated in power delivery and delivers power vertically. Custom regulators modulate directly back onto the fan-out die. Power, thermal, and mechanical are all directly connected to the chip.

Even if the chip itself only draws 10KW, the total power of the chip is still 15KW. Power delivery, IO, and wafer wiring are also consuming a lot of power. Energy comes in from the bottom and heat comes out of the top. Tesla's unit of scale is not the chip, but 25 chips. This map far exceeds the unit performance and scaling capabilities of Nvidia, Graphcore, Cerebras, Groq, Tenstorrent, SambaNova, or any other AI training project.

All of this may seem like very far-flung technology, but Tesla claims that they have already run the chip at 2GHz on a real AI network in the lab.

The next step to scale to thousands of chips is at the server level. Dojo scales to a 2 x 3 tile configuration, with two of these in a server. For those of you counting at home, that’s a total of 12 tiles per server, a total of 108 PFlops per server, over 100,000 functional units, 400,000 custom cores, and 132GB of SRAM are staggering numbers.

Tesla keeps expanding the cabinet level in their grid. There are no bandwidth breaks between chips, it is a homogenous chip grid with amazing bandwidth. They plan to scale up to 10 cabinets, 1.1 Exaflops, 1,062,000 functional units, 4,248,000 cores, and 1.33TB of SRAM.

The software side is interesting but we won't go into it too deeply today. They claim they can segment it virtually. They say the software can scale seamlessly across Dojo Processing Units (DPUs) regardless of the size of the cluster. The Dojo compiler can handle fine-grained parallelism and mapping networks of the hardware compute plane. It can do this through data model graph parallelism but can also do optimizations to reduce memory footprint.

Model parallelism can scale across chip boundaries, even without large batches, easily unlocking next-level AI models with trillions of parameters or more. They don’t need to rely on handwritten code to run the model on this massive cluster.

Overall, the cost is comparable to Nvidia's GPUs, but Tesla claims they can achieve 4x the performance, 1.3x better performance per watt, and 5x less area. Tesla's TCO advantage is almost an order of magnitude better than Nvidia's AI solution. If their claims are true, Tesla has surpassed everyone in the AI hardware and software space. I doubt it, but it's a hardware geek's dream.

We all have to try to calm down and wait and see when it will actually be deployed to production.

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 2773rd content shared by "Semiconductor Industry Observer" for you, welcome to follow.