Tesla failed to compete with Nvidia, is its self-developed chip going to be a failure?-EEWORLD

Collect

Tesla is facing multiple setbacks. On one hand, it is laying off employees worldwide, and on the other hand, it is experiencing "Black Friday".

On April 19, U.S. AI stocks experienced a tragic “Black Friday”.

Nvidia plunged 10%, with each share falling by nearly $85, the largest single-day drop since March 16, 2020, setting a new record for the largest single-day drop in history. Competitor AMD fell 5.4%, chip design company Arm fell nearly 17%, and wafer foundry leader TSMC was slightly better, falling more than 3%.

As a customer of the above-mentioned chip suppliers, Tesla was not spared and saw the largest decline this week, with a drop of more than 14%. Its market value evaporated by US$30.433 billion (approximately RMB 220 billion) in one day on April 15.

Jordan Klein, a resident analyst at Mizuho Securities, said there was a "sector-wide pullback" in the chip sector, with the pace of the pullback accelerating over the past week or so.

This is bad news for Tesla, which is increasing its investment in AI.

In September 2023, Morgan Stanley also predicted that Tesla's Dojo supercomputer used to train artificial intelligence models for self-driving cars could give the electric car maker an "asymmetric advantage" and increase its market value by nearly $600 billion.

Since the end of 2022, AI applications have exploded and are unstoppable, but now they have been hit by a pullback in institutional investment in AI. Tesla's heavily invested Dojo supercomputer project chip research and development progress is not ideal. In the reality that it is impossible to go all in, Musk cleverly made two preparations: stockpiling enough Nvidia chips, which are second only to Zuckerberg's Meta.

An industry insider told Automotive Business Review that he had been pessimistic about Musk's self-developed chips from the beginning.

"Dojo uses a server chip for large model training, which is different from the software running on the car. Secondly, it (Tesla) is not ready yet. It is not easy to manufacture chips and it takes time to accumulate. I think the best way is to buy ready-made chips like everyone else." The person said.

Tesla failed to compete with Nvidia, is its self-developed chip going to be a failure?

Self-built "dojo"

Tesla released Dojo on AI Day in 2021 and announced its self-developed chip D1.

Tesla failed to compete with Nvidia, is its self-developed chip going to be a failure?

This is the supercomputer that Tesla uses to train AI models in the cloud. The name comes from Japanese, meaning "dojo", symbolizing that it is a place for training AI.

Dojo is designed to be one of the fastest computers in the world, capable of processing massive amounts of video data to accelerate the learning and improvement of Tesla's Autopilot and Fully Self-Driving Systems (FSD), and also provide computing support for Tesla's humanoid robot Optimus.

The core of Dojo is the neural network training chip D1 designed and manufactured by Tesla, as well as the training modules, system tray and ExaPOD cluster built based on this chip.

The D1 chip is manufactured using TSMC's 7nm process. It integrates 50 billion transistors and has 354 training nodes, each of which contains a processor core, a cache, a high-bandwidth memory, and a high-speed interconnect. The peak computing power of the D1 chip is as high as 362TFLOPS and the bandwidth reaches 36 TB/s.

To further improve computing power, Tesla seamlessly connects 25 D1 chips to form a training module. Each training module has a peak computing power of 9PFLOPS and a bandwidth of 900GB/s.

These training modules form a high-density, high-performance, and highly reliable system tray, each of which can accommodate 10 training modules and is equipped with corresponding power, cooling, and network equipment. Each system tray has a peak computing power of 90 PFLOPS and a bandwidth of 9 TB/s.

Finally, based on the system tray, Tesla built an ExaPOD cluster. Each cluster consists of 10 system trays installed in a cabinet. The peak computing power of an ExaPOD cabinet model is up to 900 PFLOPS and the bandwidth is 90 TB/s.

ExaPOD, a Dojo landing form, is composed of 3,000 D1 chips and has a single-precision computing power of 1.1EFlops.

Tesla failed to compete with Nvidia, is its self-developed chip going to be a failure?

According to Tesla's public information, Dojo is based on Tesla's self-developed D1 chip and is used to replace data centers based on Nvidia A100. As of September 2022, the data center has 14,000 A100s, making it the seventh largest data center in the world.

Tesla plans to ship about 40,000 to 50,000 D1s in fiscal year 2023. The first ExaPOD was put into operation in July 2023, and it is expected to invest 6 ExaPODs in the Palo Alto data center in the short term, with a total computing power of 7.7 EFlops. By the fourth quarter of this year, Dojo's computing power goal is to reach 100 EFlops (about 91 clusters).

In the second quarter conference call in late July 2023, Musk said there was no need to make chips in-house. "If Nvidia could give us enough GPUs, maybe we wouldn't need Dojo, but they can't meet our needs."

At an important node of self-development, in November 2023, Ganesh Venkataramanan, the person in charge of the Dojo supercomputing project and also Tesla's senior director of autonomous driving hardware, resigned and was replaced by former Apple executive Peter Bannon. At that time, there were reports that Ganesh was fired most likely because Dojo's second-generation chip did not meet the standards.

Ganesh has been in charge of Tesla's Dojo supercomputing project for 5 years. Before joining Tesla, he worked at AMD, a well-known American semiconductor company, for nearly 15 years.

Ganesh's resignation is believed to be due to Tesla's poor efforts in developing its own chips, or that it is not going as smoothly as expected.

For Musk, the only measure he can take is to find ways to develop his own chips while purchasing suitable chips.

Tesla failed to compete with Nvidia, is its self-developed chip going to be a failure?

Vision and reality

A deep learning researcher with the online name "whydoesthisitch" has been studying AI chips for a long time. He analyzed why Musk's Dojo cannot rely on self-developed chips.

He believes that Dojo is likely still in a relatively early stage of development, and even if it catches up, it will still lag Nvidia by more than four years in performance.

On March 20 this year, Nvidia dropped the Blackwell B200 bomb, the next-generation data center and AI GPU that will enable a huge generational leap in computing power.

Blackwell consists of three parts: B100, B200 and the Grace-Blackwell super chip (GB200).

The new B200 GPU has 208 billion transistors and delivers up to 20 petaFlops of FP4 computing power; the GB200 combines two GPUs and a Grace CPU to deliver 30 times the performance for LLM inference workloads while greatly improving efficiency.

Nvidia says it’s “up to 25x lower” in cost and power consumption than the H100. Training a 1.8 trillion-parameter model that previously required 8,000 Hopper GPUs and 15 megawatts of power can now be done with 2,000 Blackwell GPUs using just 4 megawatts.

In the GPT-3 LLM benchmark with 175 billion parameters, the GB200 performs 7 times better than the H100, and Nvidia claims it is 4 times faster to train than the H100.

"And Tesla did exaggerate the chips themselves and their development progress," "whydoesthisitch" believes that, for example, when Tesla promoted Dojo's breakthrough in exaflop computing power and Dojo became one of the most powerful computing centers in the world, Google's data center in Mayes County, Oklahoma, had installed 8 TPUv4 system Pods, which is providing Google Cloud Department with a total computing power of nearly 9 exaflops; Amazon's AWS uses Trainium chips to achieve a computing power of 6 exaflops and uses Nvidia's H100 GPU to achieve a computing power of 20 exaflops.

He believes that if Dojo is cheap enough, it can justify replacing Nvidia. The problem is that Tesla's operating scale cannot support such a huge R&D investment.

On January 16, Rohan Patel, Tesla’s recently departed vice president of public policy and business development, posted a message on social media X saying, “Had a back-and-forth with Elon Musk on Friday night regarding a large AI data center investment. He decided to approve the plan that has been closely followed for several months. It’s hard to think of a CEO who is more involved in the most important details of the company than you can imagine.”

Tesla failed to compete with Nvidia, is its self-developed chip going to be a failure?

Hans Nelson, a veteran technology blogger who has studied Musk and Tesla for many years, said in a subsequent Wired commentary that Dojo is definitely an important part of large-scale AI data centers, but Patel's tweet did not mention Dojo. It is estimated that Dojo's chip project is a bit behind the level they hope to advance, which may mean that Dojo will use more Nvidia chips in the short term.

Dojo originally planned to rank among the top five in the world in terms of computing power by February this year, and its total computing power will reach 100 exaflops in October this year, which is equivalent to the combined computing power of 300,000 Nvidia A100s.

Nelson believes that Dojo's current computing power can reach 33 exaflops. As for how to reach 100 exaflops in October, and the respective proportions of self-developed chips and Nvidia chips used in the current computing power, it is unknown. But what is certain is that regardless of whether Dojo can achieve its computing power goals according to the schedule, Musk has stockpiled enough H100 GPUs.

Tesla failed to compete with Nvidia, is its self-developed chip going to be a failure?

(Source: Screenshot from Hans Nielsen's online conversation video)

The H100 GPU has better performance than the previous A100, especially in AI training and reasoning. The H100 is based on NVIDIA's Hopper architecture, the next generation of the Ampere architecture that supports AI and HPC, while the A100 is a product based on the Ampere architecture.

Ten days later, on January 26, New York Governor Kathy Hochul announced that Tesla would invest $500 million to build a Dojo supercomputer in Buffalo, New York.