Behind ChatGPT is Microsoft's super expensive supercomputer

Publisher:大树下的大白菜yLatest update time:2023-03-21 Source: 新智元Author: Lemontree Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

[Introduction] Behind ChatGPT is Microsoft's super expensive supercomputer, which cost hundreds of millions of dollars and used tens of thousands of chips.

ChatGPT can become a top-tier model that is popular all over the world today, thanks to the super computing power behind it.

The data shows that the total computing power consumption of ChatGPT is about 3640PF-days (that is, if it calculates one quadrillion times per second, it will take 3640 days to calculate).

So, how did the supercomputer that Microsoft built specifically for Open come into being?

On Monday, Microsoft published two articles on its official blog, personally deciphering this super expensive supercomputer and the major upgrade of Azure - adding thousands of Nvidia's most powerful H100 graphics cards and faster InfiniBand interconnect technology.

Based on this, Microsoft also officially announced the ND H100 v5 virtual machine, the specific specifications are as follows:

Eight NVIDIA H100 nsor cores connected via next-generation NVSwitch and NVLink 4.0

400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU and 3.2 Tb/s non-blocking fat-tree networking per VM

NVSwitch and NVLink 4.0 with 3.6TB/s bidirectional bandwidth between 8 local GPUs per VM

Fourth-generation Xeon Scalable Processors

PCIE Gen5 to GPU interconnect, each GPU has 6B/s bandwidth

16-channel 4800MHz DDR5 DIMM

Hundreds of millions of dollars of computing power

About five years ago, OpenAI approached Microsoft with a bold idea — to build a system that could forever change the way humans and machines interact.

At that time, no one could have imagined that this would mean that AI could use pure language to create any picture described by humans, and humans could use chat to write poems, lyrics, essays, emails, menus...

To build this system, OpenAI needs a lot of computing power—the kind that can truly support ultra-large-scale computing.

But the question is, can Microsoft do it?

After all, there was nothing that could meet OpenAI's needs at the time, and it was not certain whether building such a large supercomputer in the Azure cloud service would directly crash the system.

Then, Microsoft began a difficult period of exploration.

To build the supercomputer that supports the OpenAI project, it spent hundreds of millions of dollars connecting tens of thousands of Nvidia A100 chips together on the Azure platform and transforming server racks.

In addition, in order to tailor this supercomputing platform for OpenAI, Microsoft has been very dedicated and has been paying close attention to OpenAI's needs to keep abreast of their most critical needs when training AI.

How much does such a large project cost? Scott Guthrie, Microsoft's executive vice president for cloud computing and artificial intelligence, would not reveal the exact amount, but he said it was "probably more than" a few hundred million dollars.

The problem posed by OpenAI

Phil Waymouth, Microsoft's executive in charge of strategic partnerships, pointed out that the scale of cloud computing infrastructure required for OpenAI training models is unprecedented in the industry.

Exponentially growing clusters of networked GPUs beyond what anyone in the industry has attempted to build.

Microsoft is determined to cooperate with OpenAI because it firmly believes that this unprecedented scale of infrastructure will change history, create a new AI and a new platform, and provide customers with products and services that are truly in their interests.

Now it seems that these hundreds of millions of dollars were obviously not spent in vain - the bet was right.

On this supercomputer, OpenAI is able to train increasingly powerful models and unlock the amazing capabilities of AI tools. ChatGPT, which almost started the fourth revolution for mankind, was born.

Microsoft was very satisfied and invested another $10 billion in OpenAI in early January.

It can be said that Microsoft's ambition to break through the boundaries of AI supercomputing has paid off. What lies behind this is the transformation from laboratory research to AI industrialization.

At present, Microsoft's office software empire has begun to take shape.

The ChatGPT version of Bing can help us search for holiday arrangements; the chatbot in Viva Sales can help marketers write emails; GitHub Colot can help continue writing code; Azure OpenAI service allows us to access OpenAI's large language model and Azure's enterprise-level features.

Joining forces with NVIDIA

In fact, in November last year, Microsoft officially announced that it would join hands with Nvidia to build "one of the most powerful AI supercomputers in the world" to handle the huge computing load required to train and expand AI.

The supercomputer is based on Microsoft's Azure cloud infrastructure and uses tens of thousands of Nvidia H100 and A100 Tensor Core GPUs and its Quantum-2 InfiniBand networking platform.

Nvidia said in a statement that the supercomputer can be used to research and accelerate generative AI models such as DALL-E and Stable Diffusion.

As AI researchers begin to use more powerful GPUs to handle more complex AI workloads, they see greater potential for AI models that can understand nuance well enough to handle many different language tasks simultaneously.

In simple terms, the bigger the model, the more data you have, and the longer you can train it, the better the accuracy of the model will be.

But these larger models quickly reach the limits of existing computing resources, and Microsoft understands what kind of supercomputer OpenAI needs and how big it needs to be.

This is obviously not something you can just buy a bunch of GPUs and connect them together and start working together.

“We need to be able to train bigger models for longer periods of time, which means not only do you need to have the biggest infrastructure, you also have to have it run reliably for a long time,” said Nidhi Chappell, head of Microsoft Azure high-performance computing and artificial intelligence products.

Microsoft has to make sure it can cool all those machines and chips, using outside air in cooler climates and high-evaporative coolers in hotter ones, said Alistair Spei, Azure's director of global infrastructure.

And because all the machines were turned on at the same time, Microsoft had to think about where they were placed. Just like what might happen in your kitchen when you turn on the microwave, toaster, and vacuum cleaner at the same time, but in a data center.

Large-scale AI training

What is the key to achieving these breakthroughs?

The challenge is how to build, operate, and maintain tens of thousands of co-located GPUs interconnected on a high-throughput, low-latency InfiniBand network.

This scale is far beyond the scope of testing by GPU and network equipment vendors and is completely unknown. No one knows whether the hardware will crash at this scale.

Nidhi Chappell, head of Microsoft Azure high-performance computing and artificial intelligence products, explained that during the LLM training process, the large-scale calculations involved are usually divided among thousands of GPUs in a cluster.

In the so-called allreduce phase, the GPUs exchange information about the work they are doing, which is accelerated over the InfiniBand network so that the GPUs can complete the work before the next piece of calculation begins.

Nidhi Chappell said that since these tasks span thousands of GPUs, in addition to ensuring the reliability of the infrastructure, a lot of system-level optimization is required to achieve the best performance, which is the result of many generations of experience.

So-called system-level optimization includes software that can effectively utilize GPUs and network devices.

Over the past few years, Microsoft has developed this technology to grow the ability to train models with tens of trillions of parameters while reducing the resource requirements and time to train and serve these models in production.

Waymouth noted that Microsoft and partners have also been gradually adding capacity to GPU clusters, evolving InfiniBand networks and seeing how far they can push the data center infrastructure needed to keep GPU clusters running, including cooling systems, uninterruptible power systems and backup generators.

Eric Boyd, corporate vice president of Microsoft AI Platform, said this supercomputing capability, optimized for large-scale language model training and the next wave of AI innovation, is now directly available in the Azure cloud service.

And through its cooperation with OpenAI, Microsoft has accumulated a lot of experience, and when other partners come and want the same infrastructure, Microsoft can also provide it.

Microsoft's Azure data centers now cover more than 60 regions around the world.

New virtual machine: ND H100 v5

Microsoft has continued to improve on the above infrastructure.

Today, Microsoft announced new massively scalable virtual machines that integrate the latest NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networks.

With virtual machines, Microsoft can provide customers with infrastructure that scales to the size of any AI task. According to Microsoft, Azure's new ND H100 v5 virtual machines provide developers with excellent performance, calling thousands of GPUs simultaneously.

Reviewing Editor: Li Qian

Reference address:Behind ChatGPT is Microsoft's super expensive supercomputer

Previous article:2022 Shenzhen-Hong Kong-Macao Science and Technology Innovation Top 100-Jiangzhi Robotics is included in the list
Next article:Robot control: SPS.SUB program analysis of KUKA8.2 system

Latest robot Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号