The performance truth behind GPU utilization

Latest update time：2024-09-02

Reads：

This article is reproduced from | OneFlow

Author | Roanak Baviskar

In general, a common metric for machine learning teams to understand GPU usage is GPU utilization, which is usually viewed by running nvidia-smi in a terminal. Many integrated observability tools also track GPU utilization as their primary performance metric.

However, AI Infra team Trainy found in practice that GPU utilization is not always the best indicator to understand GPU performance. In fact, reading/writing memory without doing any calculations can achieve 100% GPU utilization!

Here, Roanak Baviskar tells how they discovered this, and other findings along the way.

(This article is compiled and published by OneFlow. Please contact for authorization if you want to reprint it. Source: https://trainy.ai/blog/gpu-utilization-misleading)

By Roanak Baviskar

OneFlow Compilation

At Trainy, we work on managing infrastructure for GPU clusters and therefore spend a lot of time thinking about these issues. Last year, we worked with a model base company to scale and improve the efficiency of their LLM training. We followed almost all of the basic steps mentioned in the PyTorch performance tuning guide, namely:

Take advantage of GPUs by changing dataloader defaults (num_workers, batch_size, pin_memory, prefetch factor, etc.).
Maximize the use of Tensor Cores by using mixed precision (fp16, bf16)
Use fused optimizers from apex/deepspeed (e.g. FusedAdam, FusedAdamW, etc.)
Use instances/networks designed for training (H100SXM, A100SXM). Also, if possible, use newer instances H100 > A100 > V100

These simple changes got us to 100% GPU utilization with excellent power efficiency, which is great! To check if there was more room for improvement, we calculated the MFU (Model Power Utilization) for our training workload.

MFU, or Modeled FLOPS Utilization, is one of the best metrics for understanding GPU performance, and was introduced in Google's PaLM paper. It is "the ratio of the observed throughput (tokens per second) to the theoretical maximum throughput of the system running at peak FLOPS". In simpler terms, it represents the number of floating point operations per second that your workload performs compared to the maximum capability of the GPU. The only real downside to MFU is that it can be a bit difficult to calculate compared to metrics like GPU utilization, as it is parameter and framework dependent.

Unfortunately, the above model training only achieved about 20% MFU. For reference, most current LLM training achieves about 35%-45% MFU. So the question becomes: how can we use only 20% of the theoretical maximum GPU computing power while achieving 100% GPU utilization?

To answer this question, we need to better understand what GPU utilization is actually tracking.

1 What exactly is GPU utilization?

GPU utilization is loosely defined in Nvidia documentation as "The utilization currently reported is for the GPU's compute resources and memory interfaces." This is very vague.

A better definition (surprisingly) can be found in Datadog’s NVML documentation ( https://docs.datadoghq.com/integrations/nvml/#metrics ), “The percentage of time that one or more kernels were executing on the GPU over the past sample period.” To determine why this definition is misleading, we need to take a quick look at how GPUs work.

GPUs have cores and multiprocessing managers ( https://cvw.cac.cornell.edu/gpu-architecture/gpu-characteristics/kernel_sm ). In Nvidia GPUs, these multiprocessing managers are called streaming multiprocessors (SMs), while in AMD hardware these are called compute units (CUs). Below is a diagram of a GH100 GPU, which has a total of 144 SMs.

These multiprocessing managers can be thought of as foremen for a group of workers, in this case, cores. When you launch a CUDA kernel, the work is executed on CUDA cores by one or more SMs. As shown below, a single SM on a GH100 chip contains many CUDA cores.

This means that the metric GPU utilization only measures whether there is a kernel executing at a particular moment. It cannot indicate whether your kernel is fully utilizing all available cores, or whether you are parallelizing the workload to the maximum capacity of the GPU. In the most extreme case, you can get 100% GPU utilization by reading and writing to memory while only performing 0 FLOPS.

Now let’s be clear: this is only misleading to people who don’t have a systems background (like many machine learning engineers). As mentioned here, the definition of GPU utilization makes sense under the “USE” methodology .

But back to the problem we face, this definition does explain the gap we saw between GPU utilization and MFU utilization! There is indeed more performance to be mined, we just need to find it.

2 Digging Deeper

The next step to search for more performance improvements is of course to profile the model’s training loop. We used PyTorch Profiler to look at the training loop to get a better understanding of what was happening.

As shown in the figure below, the Softmax kernel shows high GPU utilization, but low for a metric called SM efficiency. This is a wake-up call for us, as the naive Softmax is a notorious bottleneck in LLMs, and many kernel fusions like FlashAttention emerged to address its memory-bound nature. Given this information, the SM efficiency statistic may point to inefficiencies in our model's execution.

3 But what does SM efficiency mean?

SM efficiency (also known as SM activity) is a metric that describes what percentage of SMs of an Nvidia GPU are active during a specific time interval. As we mentioned before, an SM can be thought of as a taskmaster for a group of CUDA cores. For example, the Nvidia H100 GPU has 132 SMs, each with 128 cores, providing a total of 16,896 cores. By measuring SM efficiency, we can determine whether the CUDA cores are using our streaming multiprocessors. For example, if there is a CUDA kernel that runs continuously for 10 seconds but only uses 1 SM, then on the H100 this will be recorded as 100% utilization, but the SM efficiency will be 1/132 = 0.7%.

Great, that’s what we were looking for! We can monitor SM efficiency on a layer-by-layer basis to identify which are the low hanging fruits for potential gains in optimization.

4 optimization

Now that we can easily identify which kernels are underutilized on the GPU, we can start optimizing those layers. Since this is a transformer stack, most of the gains will be gained by the layers fused in the transformer block definition. The following chart summarizes what we optimized.

Fusion here means that instead of using a set of layers defined natively in PyTorch, it is replaced with a GPU kernel that uses CUDA or Triton implementations to merge all layers into one kernel. The result of this speedup is that the time to read/write GPU memory for each kernel is reduced compared to the mathematical operation time of some layers (such as Softmax). Flash Attention is an example of such a fused kernel. Other kernels that need to be fused include MLP, dropout layer normalization, and residual sum operations.

Did we write these kernels ourselves? No. Most kernels are already implemented in libraries like Flash Attention, which provide implementations of nn.Modules layers so that you don't have to implement a kernel from scratch using torch.autograd.function. In addition, these implementations are usually hardware optimized, so they are not only faster, but also use less memory.

The biggest challenge is to determine where in your code you need to replace the appropriate layers. While torch.compile tries to do this automatically, as of now, torch.compile is not compatible with newer distribution strategies such as FSDP (https://dev-discuss.pytorch.org/t/torch-compile-fsdp-dec-8th/1718) and in practice does not provide much of the promised speedup due to graph breaks. Hopefully in the future, the torch compiler will be able to do this automatically for us, but for now we still have to add the fusion implementation manually.

In terms of results, we achieved a 4x speedup in training time, and the customer's MFU increased from an initial 20% MFU to 38%. Most of the optimizations we made came from these fused kernels and finding the right "level" of model parallelism given their model size and the available 3.2 Tbps Infiniband bandwidth.

5 in conclusion

We strongly recommend that most AI teams track SM efficiency on their GPU clusters as well as GPU utilization. This provides a more accurate performance metric of how much performance you are squeezing out of the GPU, while GPU utilization measures whether the machine is idle. Of course, calculating MFU is also great, but it is not a metric that can be monitored all the time. Meanwhile, Nvidia DCGM (Data Center GPU Manager) provides SM activity data by default.

There are also more granular metrics, such as SM occupancy (or Achieved Occupancy in the PyTorch Profiler), which tell us how much work each SM is performing. However, understanding these metrics is more complicated than just making the SM as efficient as possible. If you are interested in learning more, I recommend checking out the PyTorch Profiler blog, the DCGM documentation, Nsight’s Kernel Performance Analysis Guide, and the Nsight documentation.

Good luck squeezing the most out of your GPU!

Note: The cover image is from the Douban movie "Edge of Tomorrow" stills

-END-

The content of this article is for communication and learning purposes only and does not constitute any investment advice. Some pictures are from the Internet and the copyright ownership has not been verified. It is not for commercial use. If there is any infringement, please contact us at info@gsi24.com.

▼ Highlights from previous issues ▼

my country's leading chip company is heavily invested by foreign investors at RMB 11 billion

Domestic chips are too difficult! Financing winter, IPO failures, and mergers and acquisitions are also blocked

Too difficult! After 7 years of experience in chip sales, he decided to start driving an online car-hailing service

The chip unicorn invested by Lei Jun is listed

After 30 years of efforts, China's CPU chips are now on the table

Moutai invested in a chip company

Latest articles about

■40 billion, a photovoltaic unicorn IPO emerged in Shenzhen

■With hundreds of millions of yuan invested in one month, smart driving has entered the computing power game

■Electron microscope, how can my country break the overseas monopoly?

■Chip semiconductor knowledge for beginners

■Mass production of photoresist is only the first step to a breakthrough

■113 winners! The results of the 19th "China Chip" Excellent Product Collection are released

■Guangzhou, the birth of a new semiconductor unicorn

■New Motor Product Line | Jihai launches GHD3440Rx advanced upgraded version of motor-specific gate driver

■Actions Technology's Zhou Zhengyu: Actions Intelligence: The Future of AI Audio Chips on the Edge

■It’s no use getting close to Nvidia, the myth of 40x AI bull stocks is completely shattered!