NVIDIA's CUDA monopoly is hard to protect: PyTorch continues to tear down the tower, and OpenAI is already stealing the money

Latest update time：2023-01-28 12:20

Reads：

James Alex Posted from Aofei Temple
Qubit | Official account QbitAI

Nvidia's software moat is disappearing.

As PyTorch supports more GPU manufacturers and OpenAI's Triton disrupts the market, CUDA , the powerful weapon in Nvidia's hands, is gradually losing its edge.

The above point of view comes from Dylan Patel, chief analyst of Semi Analysis, and related articles have attracted a wave of industry attention.

Some netizens commented after watching:

Nvidia has fallen into this situation, giving up innovation just for the sake of immediate profits.

Sasank Chilamkurthy, one of the authors of Pytorch, also added:

When Nvidia offered to acquire Arm, I was very uneasy about the potential monopoly. So I started doing what any sane person would do : get CUDA removed from the leading AI framework.

Let’s talk about the reasons behind these mentioned by Patel.

PyTorch becomes the winner of AI development framework and will support more GPUs

Let’s first briefly talk about the glorious story of CUDA in the past.

CUDA is a parallel computing framework launched by NVIDIA.

CUDA is a turning point in history for NVIDIA. Its emergence has allowed NVIDIA to take off rapidly in the field of AI chips.

Before CUDA, NVIDIA's GPU was just a "graphics processing unit" responsible for drawing images on the screen.

CUDA can not only call GPU calculations, but also call GPU hardware acceleration , giving the GPU the ability to solve complex computing problems and helping customers program the processor for different tasks.

In addition to common PCs, many popular devices such as unmanned vehicles, robots, supercomputers, and VR helmets have GPUs. For a long time, only NVIDIA's GPUs can quickly handle various complex AI tasks .

So why did CUDA, which was so famous, lose its status later on?

This has to start with the debate over AI development frameworks, especially PyTorch VS TensorFlow .

If PyTorch and other frameworks are compared to cars , then CUDA is the gearbox - it can accelerate the calculation process of the machine learning framework. When running PyTorch on an NVIDIA GPU, deep learning models can be trained and run faster.

TensorFlow developed early and is also Google's weapon. However, in the past two years, its momentum has been gradually surpassed by PyTorch. At several major summits, the proportion of PyTorch framework usage has also increased significantly:

△ Picture source: The Gradient, the proportion of papers specifically mentioning PyTorch in several top conferences

There are also testimonials from in-depth TensorFlow users: “Now I switch to PyTorch.”

A key factor for PyTorch to win is that it is more flexible and easier to use than TensorFlow.

On the one hand, this is due to PyTorch's eager mode, which supports modifying the model in the C++ running environment, and you can also immediately see the results of each step of the operation. While TensorFlow now also has an eager mode, most large tech companies are already developing solutions around PyTorch. (It hurts my heart...)

On the other hand, although both are used to write Python, PyTorch is more comfortable.

In addition, PyTorch has more available models and a richer ecosystem. According to statistics, 85% of large models in HuggingFace are implemented using the PyTorch framework.

In the past, although there was fierce competition among major AI development frameworks, the lower-level parallel computing architecture CUDA could be regarded as the dominant one.

But times have changed. In the competition of AI frameworks, PyTorch finally defeated TensorFlow, which was the previous leader. Its position was temporarily stable, and then it started to cause trouble.

In recent years, PyTorch has been expanding to support more GPUs. The soon-to-be-released first stable version of PyTorch 2.0 will also improve support for other GPUs and accelerators, including AMD, Intel, Tesla, Google, Amazon, Microsoft, Meta etc.

In other words, Nvidia GPU is no longer the only one

But behind this there are actually problems with CUDA itself.

Memory walls are a problem

As mentioned earlier, the rise of CUDA and the wave of machine learning promote each other and achieve win-win development, but there is a phenomenon worthy of attention:

In recent years, the FLOPS of leader NVIDIA hardware has continued to increase, but its memory improvement has been very limited . Take the V100 that trained BERT in 2018 as an example. As the most advanced GPU, it increased by an order of magnitude in FLOPS, but the memory increase was not much.

△ Source: semianalysis

In actual AI model training, as the model gets larger, the memory requirements also increase.

For example, Baidu and Meta require tens of terabytes of memory to store massive embedding tables when deploying production recommendation networks.

During training and inference, a lot of time is not actually spent on matrix multiplication calculations, but on waiting for data to arrive at computing resources.

So why not get more memory?

In short, the banknote capacity is insufficient .

Generally speaking, memory systems arrange resources according to the structure from "near and fast" to "slow and cheap" based on data usage needs. Usually, the nearest shared memory pool is on the same chip and is usually composed of SRAM.

In machine learning, some ASICs try to use a huge SRAM to save model weights. This method is not enough when encountering model weights of 100B+. After all, even a wafer-level chip worth about $5 million only has 40GB of SRAM space.

When put on Nvidia's GPU, the memory is even smaller: the A100 is only 40MB, and the next-generation H100 is 50MB. Calculated based on the price of mass-produced products, the cost of SRAM memory per GB for a chip is as high as $100.

The accounts haven't been settled yet. Currently, the cost of on-chip SRAM has not dropped significantly with the improvement of Moore's Law technology. If TSMC's next-generation 3nm process technology is used, the same 1GB will cost even more.

Compared with SRAM, DRAM is much lower cost, but the delay is an order of magnitude higher, and the cost of DRAM has hardly dropped significantly since 2012.

As AI continues to develop, the demand for memory will increase, and this is how the memory wall problem was born.

DRAM currently accounts for 50% of the total cost of servers. For example, Nvidia's P100 in 2016 has a performance improvement of 46 times compared to the latest H100, but the memory capacity has only increased by 5 times.

△ NVIDIA H100 Tensor Core GPU

Another issue also related to memory is bandwidth.

During the calculation process, increasing memory bandwidth is obtained through parallelism. To this end, NVIDIA uses HBM memory (High Bandwidth Memor) , which is a structure composed of 3D stacked DRAM layers. The packaging is more expensive, allowing for simple use of funds. They could only stare.

As mentioned earlier, one of the major advantages of PyTorch is that the Eager mode makes AI training and reasoning more flexible and easy to use. But its memory bandwidth requirements are also very large.

Operator fusion is the main method to solve the above problems. The essence is "fusion", which means that each intermediate calculation result is not written into the memory, but is transferred once to calculate multiple functions, thus reducing the amount of memory reading and writing.

△ Operator fusion image source: horace.io/brrr_intro.html

To implement "operator fusion", you need to write a custom CUDA kernel and use C++ language.

At this time, the disadvantages of CUDA become apparent: compared to writing Python scripts, writing CUDA is much more difficult for many people...

In comparison, PyTorch 2.0 tools can significantly lower this threshold. It has built-in NVIDIA and external libraries, so there is no need to specifically learn CUDA. You can add operators directly using PyTorch, which is naturally much more friendly to alchemists.

Of course, this has also led to PyTorch adding a large number of operators in recent years, once exceeding 2,000 (manual dog heads) .

At the end of 2022, the newly released upgraded PyTorch 2.0 is making great efforts to target compilation.

Due to the addition of a compilation solution for image execution models, the framework's training performance on the A100 is improved by 86%, and CPU inference performance is also improved by 26%.

In addition, PyTorch 2.0 relies on PrimTorch technology to reduce the original more than 2,000 operators to 250, making more non-NVIDIA backends more accessible; it also uses TorchInductor technology to automatically generate fast code for multiple accelerators and backends. .

Moreover, PyTorch 2.0 can better support data parallelism, sharding, pipeline parallelism and tensor parallelism, making distributed training smoother.

It is the above-mentioned technologies, combined with support for GPUs and accelerators from manufacturers other than NVIDIA, that the original software wall built by CUDA for NVIDIA does not seem so unattainable.

There is a replacement behind you

Nvidia's own memory improvement speed here has not kept up, and there is still PyTorch2.0 causing trouble, but it is not over yet——

Open AI has launched a “simplified version of CUDA”: Triton . (Steal directly)

Triton is a new language and compiler. It is less difficult to operate than CUDA, but its performance is comparable to the latter.

OpenAI claims:

With only 25 lines of code, Triton can achieve performance comparable to cuBLAS in FP16 matrix multiplication.

OpenAI researchers have used Triton to generate a kernel that is twice as efficient as the equivalent Torch.

Although Triton currently only officially supports NVIDIA GPUs, this architecture will also support multiple hardware vendors in the future.

It is also worth mentioning that Triton is open source. Compared with closed-source CUDA, other hardware accelerators can be directly integrated into Triton, greatly reducing the time to build an AI compiler stack for new hardware.

But having said that, some people feel that CUDA's monopoly is far from being broken. For example, Soumith Chintala, another author of PyTorch and a distinguished engineer at Meta, feels:

(Written by Analyst Patel) This article exaggerates the reality that CUDA will continue to be a key architecture that PyTorch relies on.

Triton is not the first (optimizing) compiler. Currently, most attention is still paid to the XLA compiler.

He said that it is still unclear whether Triton will be gradually accepted by everyone. This will have to be verified by time. In short, Triton does not pose much threat to CUDA.

The author of the article, Patel himself, also saw this comment and replied:

I didn't say that (CUDA's monopoly position) has been broken (Broken) , but that it is breaking (Breaking) .

Moreover, Triton currently only officially supports NVIDIA GPUs (the performance has not been tested on other GPUs) . If XLA does not have an advantage in performance on NVIDIA GPUs, then it may not be as good as Triton.

But Soumith Chintala countered that it would be inappropriate to say that CUDA's status is declining. Because if Triton wants to be promoted on hardware, there are still many risks and there is still a long way to go.

Some netizens are on the same side as the PyTorch author:

I also hope that the monopoly will be broken, but currently CUDA is still the top one. Without it, the software and systems built by many people will not work at all.

So, what do you think of the current situation of CUDA?

Reference links:
[1]https://www.semianalysis.com/p/nvidiaopenaitritonpytorch
[2]https://analyticsindiamag.com/how-is-openais-triton-different-from-nvidia-cuda/
[3]https ://pytorch.org/features/
[4]https://news.ycombinator.com/item?id=34398791
[5]https://twitter.com/soumithchintala/status/1615371866503352321
[6]https:// twitter.com/sasank51/status/1615065801639489539

-over-

Baidu Research Institute, Alibaba Damo Academy, Quantum Bit Think Tank

The top ten science and technology reports of the year

Summarize 2022 and foresee 2023. The top ten annual science and technology reports from Baidu Research Institute, Alibaba Damo Research Institute and Qubit Think Tank have been released. Click on the picture below to jump to view.

Baidu Research Institute

Alibaba Damo Academy

Quantum Think Tank