2080 Ti can run 70B large models, and the new framework can increase LLM inference speed by 11 times.

Latest update time：2023-12-20

Reads：

Submission from IPADS Laboratory of Shanghai Jiao Tong University
Qubits | Public account QbitAI

The work that originally required an 80G A100 worth 160,000 yuan now only requires a 24G 4090 worth less than 20,000 yuan!

PowerInfer, an open source inference framework launched by the IPADS Laboratory of Shanghai Jiao Tong University, speeds up large model inference by 11 times.

And without quantization, just use FP16 precision, and the 40B model can be run on a personal computer; if quantization is added, the 2080 Ti can also run the 70B model smoothly.

Combining the unique characteristics of large models and through hybrid computing between CPU and GPU, PowerInfer can achieve fast inference on personal computers with limited video memory.

Compared with llama.cpp, PowerInfer achieves up to 11 times acceleration, allowing the 40B model to output ten tokens per second on a personal computer.

ChatGPT, which we are most familiar with, sometimes crashes due to excessive traffic. On the other hand, there are also data security issues.

The open source model can better solve these two problems, but without a high-performance graphics card, the running speed is often very impressive:

The emergence of PowerInfer just solves this pain point.

PowerInfer aroused enthusiastic response as soon as it was released. It received 500+ stars in less than 24 hours, including one from Gerganov, the author of llama.cpp.

At present, the source code and papers of PowerInfer have been made public. Let’s take a look at how strong its acceleration effect is.

Inference speed up to 11 times

On a consumer-grade hardware platform equipped with x86 CPU and NVIDIA GPU, PowerInfer tested the end-to-end inference speed of PowerInfer based on a series of LLM models with parameter sizes ranging from 7B to 175B, and compared it with the best performance on the same platform. The inference framework llama.cpp was compared.

For FP16 precision models, PowerInfer achieved an average speed increase of 7.23 times on a high-end PC (PC-High) equipped with a 13th generation Intel Core i9 and a single RTX 4090, of which up to 11.69 times was achieved on Falcon 40B . Speed boost.

Across all test cases, PowerInfer reached an average of 8.32 tokens/s, with a maximum of 16.06 tokens/s and 12.94 tokens/s on OPT 30B and Falcon 40B respectively.

With PowerInfer, today's consumer-grade platforms can run 30-40B level LLM smoothly and 70B level LLM at an acceptable speed .

△ PowerInfer averagely generates token speed test charts under different output lengths in different models. The ordinate is the acceleration ratio. The number marked on each bar graph represents the number of tokens that can be generated per second.

Model quantization is a very common technology for end-side LLM inference, and PowerInfer also supports the inference of INT4 quantized models.

PowerInfer tested the inference speed of a series of INT4 quantized models on high-end PCs (PC-High) and mid-to-low-end PCs (PC-Low) equipped with a single RTX 2080Ti.

On PC-High, PowerInfer can run 40-70B scale models at high speed , reaching a maximum inference speed of 29.09 tokens/s, and achieving an average speed increase of 2.89 times and a maximum of 4.28 times.

At the same time, it is also possible to run models of the size of the OPT-175B on consumer-grade hardware.

On mid-to-low-end PCs such as PC-Low, PowerInfer can smoothly run models of 30-70B size and achieve an average speed increase of 5.01 times and a maximum of 8.00 times. This is mainly due to the majority of hot neurons in the model after INT4 quantization. be placed in video memory.

△ The inference speed of PowerInfer in the INT4 quantitative model. The ordinate is the acceleration ratio. The number marked on each bar graph represents the number of tokens that can be generated per second.

Finally, PowerInfer compared the end-to-end inference speed of running PowerInfer on PC-High compared to the top cloud computing card A100 running SOTA framework vLLM. The test models were OPT-30B and Falcon-40B (ReLU) with FP16 accuracy.

When the input length is 64, the speed gap between PowerInfer and A100 is reduced from 93%-94% to 28%-29%; in a pure generation scenario with an input length of 1, this gap is further reduced to as low as 18% .

This means that PowerInfer has greatly bridged the inference speed gap from consumer-grade graphics cards to top server-side computing cards with the help of sparse activation and CPU/GPU hybrid inference.

△ Performance comparison of PowerInfer on 4090 and vLLM on A100

So, how does PowerInfer achieve high-speed inference on consumer-grade hardware?

Take full advantage of model and hardware features

The secret of PowerInfer to achieve high-speed inference is to make full use of the high locality of sparse activation in dense models and fully combine it with the computing characteristics of CPU and GPU.

What is "sparse activation"?

Recently, the large Mixtral MoE model has detonated the entire AI circle, and the sparse model has re-entered everyone's field of vision.

An interesting fact is that LLM, which is regarded as a dense model, such as OPT and LLaMA (ReLU), also has the characteristics of sparse activation.

What is sparse activation of dense models?

Similar to the MoE model in which an input token only needs to activate one or two expert modules in the FFN layer, taking the dense FFN layer of the OPT model as an example, only a small part (experiments show about 10%) of neurons need to be activated to ensure output. correctness.

Although other neurons participate in the calculation, they do not significantly contribute to the output.

In other words, every neuron in the dense model is an expert !

△ The picture on the left comes from Alexander Clark’s paper (aRXiv number: 2101.03961)

The MoE model can distribute the input to one or two experts through the routing module before the expert FFN layer for calculation, so how to route the sparse activations in the dense model or know which expert neurons will contribute to the result before calculation Woolen cloth?

The answer is to add a route prediction module to the dense model .

Before the model starts serving, PowerInfer will first conduct offline analysis of the model, obtain the correspondence between each layer's input and activated neurons by inferring the model in a common data set, and then train a small model for each layer of the dense model. The predictive routing module predicts the neurons that will be activated for each input and only counts the neurons activated by the route (experts).

In tests on multiple downstream tasks, PowerInfer's routing module introduced almost no additional accuracy loss.

Inference locality brought about by sparse activation

Another interesting fact about sparse activation is that although there are differences in the distribution of activated neurons for different input tokens; if inference is performed on enough data and the distribution of each activation is superimposed, PowerInfer finds a small number of neurons The overall probability of being activated is higher.

In other words, in a statistical sense, the activation of large model neurons conforms to the Power Law distribution (Power Law distribution is a statistical law that means that a small number of events occur much more frequently than a large number of other events).

As shown in Figure (a) below, for a certain layer of FFN networks in the OPT-30B and LLaMA(ReGLU)-70B models, statistically 26% and 43% of the neurons contributed 80% of the activation respectively.

At the scale of the entire model, as shown in (b) below, 17% and 26% of neurons contribute 80% of the activation.

Therefore, when only operations that contribute to the final activation are considered, LLM suffers from inference locality: access to weights tends to be concentrated in a certain area rather than evenly distributed across all neurons.

In reasoning operations, it appears as the locality of the program: accesses to the memory space tend to be concentrated in a certain area, rather than evenly distributed throughout the memory space.

In common personal computers, the GPU has less video memory and stronger computing power , and is suitable for processing frequently accessed and highly computationally intensive tasks; while the CPU has larger memory capacity but relatively weak computing power , and is suitable for processing a small amount of data. Accessible and low computationally intensive tasks.

Therefore, ideally, a small number of frequently accessed neurons should be stored in video memory, while larger, less frequently accessed neurons are more suitable to be stored in memory and computed by the CPU.

This inspired PowerInfer to design a CPU/GPU hybrid inference system based on local characteristics.

CPU/GPU hybrid inference design

Based on the Power Law of the above-mentioned neurons and the resulting locality, PowerInfer statically analyzes the hot and cold properties of each neuron in advance, loads a small number of hot neurons into the GPU memory, and loads the remaining cold neurons into the CPU. in memory.

Mixed loading of models with neuron granularity will cause some neurons in a layer to be on the GPU and some on the CPU.

To this end, PowerInfer designed a fine-grained CPU/GPU hybrid inference engine.

For example, in the figure below, for the input of a certain layer, PowerInfer will first predict that the input will activate neurons 3, 4, and 5.

Then the CPU and GPU will perform calculations on the neurons located in their memories based on the prediction information.

Specifically, in the example in the figure below, the fourth neuron will be calculated on the CPU, the third and fifth neurons will be calculated on the GPU, and then the calculation results of both sides will be merged on the GPU.

△ PowerInfer hybrid computing method

The overall architecture of PowerInfer

Overall, PowerInfer has developed an innovative CPU/GPU hybrid inference engine by utilizing sparse activation based on dense models and the locality properties it introduces.

When connecting to a large language model (LLM), PowerInfer first trains the model's predictive routing module in the offline stage and conducts in-depth analysis of the model's activation features.

At the same time, the optimal neuron placement strategy is calculated based on key information such as the bandwidth and capacity of the target hardware.

On this basis, PowerInfer will optimally distribute neurons in memory or video memory based on these calculation results.

During the online inference phase, the CPU and GPU separately process the neurons stored in their memories, and the results of these independent calculations are subsequently efficiently merged on the GPU.

△ PowerInfer overall architecture diagram

Summary and Outlook

For end-side users, PowerInfer’s efficient reasoning framework opens up new possibilities.

First, it enables PC users to run advanced large-scale language models locally without the need for expensive specialized hardware.

This not only promotes the popularization of artificial intelligence applications, but also provides unprecedented opportunities for enthusiasts, researchers, and small businesses.

In terms of cloud deployment, PowerInfer also has huge potential.

Existing cloud CPUs are also supported by powerful AMX computing units. By taking advantage of the heterogeneous characteristics between CPUs and GPUs, we can be optimistic that PowerInfer can use fewer high-end computing cards to achieve higher service throughput.

Paper address:
https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf

-over-

Click here ???? Follow me and remember to star~

Three consecutive clicks of "Share", "Like" and "Watching"

Advances in cutting-edge science and technology are seen every day ~

Latest articles about

■Domestic 4o large model, understand the national style Li Ziqi in seconds

■The search engine for life is free to use, the open source version of Harry Potter's "Pensieve" is on the GitHub hot list, and it supports Chinese

■iPad can use AI painting interactive editing tool to become popular, netizens: tremble PS

■Real data for various tasks, large-scale online shopping benchmark Shopping MMLU open source｜NeurIPS&KDD Cup 2024

■Scheduled for December 11, registration for the MEET2025 Smart Future Conference has opened!

■2499, AI concentration is off the charts! Wear this pair of glasses, order coffee/real-time translation/AR navigation in one sentence

■Terminus launches its first universal intelligent agent, achieving high-dimensional perception of the physical world

■HKUST's embodied robotics team receives billions of yuan in funding

■ChatGPT paid features are free! Mistral copied Canvas and Artifacts

■Qwen2.5 updates millions of super-long contexts, speeding up inference by 4.3 times. Netizens: RAG is going to be outdated