NVIDIA: Cracks in the Empire

Latest update time：2023-12-20

Reads：

The outside world often has the illusion that Intel is a successful hardware company because its CPUs sell well. In fact, Intel's dominance in desktop processors is the X86 architecture that was born in 1978.

The same illusion exists with Nvidia.

The reason why NVIDIA is able to monopolize the artificial intelligence training chip market is that the CUDA architecture is definitely one of the heroes behind the scenes.

This architecture, born in 2006, has been involved in various fields of computer computing and has almost been shaped into the shape of NVIDIA. 80% of research in the fields of aerospace, biological science research, mechanical and fluid simulation, and energy exploration is conducted on the basis of CUDA.

In the hottest field of AI, almost all major manufacturers are preparing Plan B: Google, Amazon, Huawei, Microsoft, OpenAI, Baidu... No one wants to let their future be in the hands of others.

Dealroom.co, an entrepreneurial service consulting organization, has released a set of data. During this wave of generative AI, the United States has received 89% of global investment and financing, and in the investment and financing of AI chips, China ranks first in the world in AI chip investment and financing. , more than twice that of the United States.

In other words, although there are many differences between Chinese and American companies in the development methods and stages of large models, they are particularly consistent when it comes to controlling computing power.

Why does CUDA have this magic?

In 2003, in order to compete with Intel, which launched a 4-core CPU, NVIDIA began to develop unified computing device architecture technology, also known as CUDA.

The original intention of CUDA was to add an easy-to-use programming interface to the GPU so that developers do not need to learn complex shading languages or graphics processing primitives. NVIDIA's original idea was to provide game developers with an application in the field of graphics computing, which is what Huang Renxun calls "make graphics programmable."

However, after the launch of CUDA, key applications have not been found and important customer support has been lacking. Moreover, NVIDIA had to spend a lot of money to develop applications, maintain services, and promote and market it. By the time it encountered the financial crisis in 2008, NVIDIA's revenue plummeted due to poor sales of graphics cards, and its stock price once fell to only $1.50, which was worse than AMD. It's even worse.

Until 2012, two of Hinton's students used Nvidia GPUs to participate in an image recognition speed competition called ImageNet. They used GTX580 graphics cards and CUDA technology for training. The results were dozens of times faster than the second place, and the accuracy was more than 10% higher than the second place.

It wasn’t just the ImageNet model itself that shocked the industry. This neural network, which required 14 million images and a total of 262 petaflops of floating-point operations to train, used only four GTX 580s during the training process for a week. For reference, Google Cat used 10 million images, 16,000 CPUs, and 1,000 computers.

This competition is not only a historical turning point for AI, but also opens up a breakthrough for NVIDIA. NVIDIA began to cooperate with the industry to promote the AI ecosystem, promote open source AI frameworks, and cooperate with companies such as Google and Facebook to promote the development of AI technologies such as TensorFlow.

This is equivalent to completing the second step in Huang Renxun's words, "open up GPU for programmability for all kinds of things."

When the value of GPU computing power was discovered, major manufacturers suddenly realized that CUDA, which NVIDIA had iterated and paved over the past few years, had become a high wall that AI could not bypass.

In order to build the CUDA ecosystem, NVIDIA provides developers with a wealth of libraries and tools, such as cuDNN, cuBLAS, and TensorRT, to facilitate developers to perform tasks such as deep learning, linear algebra, and inference acceleration. In addition, NVIDIA also provides a complete development tool chain including CUDA compiler and optimizer, allowing developers to more conveniently perform GPU programming and performance optimization.

At the same time, NVIDIA also works closely with many popular deep learning frameworks (such as TensorFlow, PyTorch and MXNet), providing CUDA with significant advantages in deep learning tasks.

This kind of dedication of "helping the horse and giving it a ride" enabled NVIDIA to double the number of developers in the CUDA ecosystem in just two and a half years.

This is not enough. Over the past ten years, NVIDIA has promoted CUDA teaching courses to more than 350 universities. There are professional developers and domain experts in the platform. They provide a wealth of experience for CUDA applications by sharing their experiences and solving difficult problems. support.

More importantly, NVIDIA is well aware that the shortcoming of hardware as a moat is that it has no user stickiness, so it bundles hardware with software. GPU rendering requires CUDA, AI noise reduction requires OptiX, and autonomous driving calculations require CUDA...

Although NVIDIA currently monopolizes 90% of the AI computing power market with its GPU+NVlink+CUDA, there are already more than one cracks in the empire.

cracks

AI manufacturers have been struggling with CUDA for a long time, which is not an alarmist statement.

The magic of CUDA is that it is at the key position of combining software and hardware. For software, it is the cornerstone of the entire ecosystem. It is difficult for competitors to bypass CUDA to be compatible with NVIDIA's ecosystem. For hardware, CUDA's design is basically NVIDIA's software abstraction of hardware form basically corresponds to each core concept of GPU hardware concept.

Then for competitors, there are only two options left:

1 Bypassing CUDA and rebuilding a software ecosystem will face the huge challenge of NVIDIA user stickiness;

2 Compatible with CUDA, but you also have to face two problems. First, if your hardware route is inconsistent with NVIDIA, then the implementation may be inefficient and uncomfortable. Second, CUDA will follow the evolution of NVIDIA hardware features, so compatibility is the only choice. follow.

But in order to get rid of Nvidia's clampdown, some people have tried both options.

In 2016, AMD launched the GPU ecosystem ROCm based on open source projects, which provides HIP tools that are fully compatible with CUDA, which is a follow-up route.

However, due to constraints such as the lack of tool chain library resources and the high cost of development and iteration compatibility, it is difficult for the ROCm ecosystem to grow. On Github, there are more than 32,600 developers contributing to the CUDA software package repository, while there are less than 600 ROCm.

The difficulty in taking the NVIDIA CUDA compatible route is that its update and iteration speed can never keep up with CUDA and it is difficult to be fully compatible:

1 Iteration is always one step slower: Nvidia GPU iterates very quickly on micro-architecture and instruction sets, and many parts of the upper software stack also need to be updated with corresponding functions. But it is impossible for AMD to know Nvidia's product roadmap, and software updates will always be one step slower than Nvidia. For example, AMD may have just announced support for CUDA11, but NVIDIA has already launched CUDA12.

2 The difficulty of being fully compatible will increase the workload of developers: the architecture of large-scale software like CUDA is very complex, and AMD will need to invest a lot of manpower and material resources and take several years or even more than ten years to catch up. Because there are inevitably functional differences, if the compatibility is not good, it will affect the performance (although it is 99% similar, solving the remaining 1% of differences may consume 99% of the developer's time).

There are also companies that choose to bypass CUDA, such as Modular, which was established in January 2022.

Modular’s idea is to lower the threshold as much as possible, but it is more like a surprise attack. It proposes an AI engine "used to improve the performance of artificial intelligence models" and solves the problem of "current AI application stacks are often coupled with specific hardware and software" through a "modular" approach.

In order to cooperate with this AI engine, Modular also developed the open source programming language Mojo. You can think of it as a programming language "designed for AI". Modular uses it to develop various tools and integrate them into the aforementioned AI engine. At the same time, it can be seamlessly connected to Python, reducing learning costs.

But the problem with Modular is that the "full platform development tool" it envisions is too idealistic.

Although it bears the title of "Beyond Python" and is endorsed by Chris Lattner's reputation, Mojo, as a new language, still needs to be tested by many developers in its promotion.

The AI engine faces even more problems. It not only needs to reach agreements with many hardware companies, but also needs to consider compatibility between various platforms. These are tasks that require a long period of polishing to complete. I’m afraid no one will know what NVIDIA will evolve into by then.

Challenger Huawei

On October 17, the United States updated its export control regulations for AI chips, preventing companies such as Nvidia from exporting advanced AI chips to China. According to the latest rules, Nvidia’s chip exports to China, including the A800 and H800, will be affected.

After Nvidia's A100 and H100 models were previously restricted from being exported to China, the "castrated versions" of the A800 and H800 exclusively supplied to China were designed to comply with the regulations. Intel also launched the AI chip Gaudi2 for the Chinese market. Now it seems that companies have to adjust and respond to the new round of export bans.

In August this year, the Mate 60 Pro equipped with Huawei's self-developed Kirin 9000S chip suddenly went on sale, instantly triggering a huge wave of public opinion, causing another piece of news at almost the same time to be quickly submerged.

Liu Qingfeng, chairman of iFlytek, made a rare statement at a public event, saying that Huawei's GPU can benchmark Nvidia A100, but only if Huawei sends a special working group to set up a special team in iFlytek to optimize work.

Such sudden statements often have deep-seated intentions. Although there is no predictability, their effect is still to cope with the chip ban in two months.

Huawei GPU is the Ascend AI full-stack software and hardware platform. The full stack includes 5 layers, including Atlas series hardware, heterogeneous computing architecture, AI framework, application enablement, and industry applications from bottom to top.

Basically, it can be understood that Huawei has made a set of replacements for NVIDIA. The chip layer is Ascend 910 and Ascend 310, and the heterogeneous computing architecture (CANN) benchmarks NVIDIA's CUDA + CuDNN core software layer.

Of course, there cannot be no gap. Some relevant practitioners have summarized two points:

1 The performance of a single card lags behind. There is still a gap between Ascend 910 and A100, but the advantage is that it is cheap and can be stacked. The overall gap is not big after reaching the cluster scale;

2 Ecological disadvantages do exist, but Huawei is also working hard to catch up. For example, through the cooperation between the PyTorch community and Ascend, PyTorch 2.1 version has simultaneously supported Ascend NPU, which means that developers can directly develop models based on Ascend on PyTorch 2.1.

At present, Huawei Ascend mainly runs Huawei's own closed-loop large model products. Any public model must undergo in-depth optimization by Huawei before it can run on Huawei's platform, and this part of the optimization work relies heavily on Huawei.

In the current context, Shengteng has special significance.

In May this year, Zhang Dixuan, president of Huawei's Ascend Computing Business, revealed that the "Ascend AI" basic software and hardware platform has incubated and adapted more than 30 mainstream large models. More than half of my country's native large models are based on "Ascend AI" "Basic software and hardware platforms are built, including Pengcheng series, Zidong series, Huawei Cloud Pangu series, etc. In August this year, Baidu also officially announced the advancement of the adaptation of Shengteng AI to the Fei Paddle + Wenxin large model.

And according to a picture circulated on the Internet, China's artificial intelligence supercomputing center is basically Ascend except that it has not been disclosed. It is said that after the new round of chip restrictions, 30-40% of Huawei's chip production capacity will be reserved for Ascend. cluster, the rest are Kirin.

end

In 2006, when NVIDIA launched its grand narrative, no one thought that CUDA would be a revolutionary product. Huang Renxun had to persuade the board of directors to invest US$500 million every year to gamble on an unknown with a payback period of more than 10 years. NVIDIA's revenue that year It’s only 3 billion US dollars.

But in all business stories with technology and innovation as key words, there are always people who have achieved great success because of their persistence in long-term goals, and Nvidia and Huawei are among the best.

-END-

The content of this article is for communication and learning purposes only. If you have any questions, please contact us at info@gsi24.com.