50s to complete 7B model quantization, 4bit reaches new SOTA, new tricks for low-bit quantization of large models | NeurIPS 2024 Oral

Latest update time：2024-11-07

Reads：

DuQuant team contribution
Quantum Bit | Public Account QbitAI

Eliminating outliers, a new method for low-bit quantization of large language models

A recent paper by the team from the Institute of Automation, Tsinghua University, and City University of Hong Kong was selected for NeurIPS 2024 (Oral Presentation) . They proposed two orthogonal transformations for LLM weight activation quantization, which effectively reduced the outliers phenomenon and achieved a new 4-bit SOTA.

Simply put, in a large language model (LLM), some of the values (activation values) output by the intermediate layers will become very large. They are called "outliers", and these outliers bring challenges to model quantization.

In addition, quantization can convert the numerical values in the model from floating point numbers to integers to reduce the size of the model and the computational requirements.

Once there are a large number of outliers in the quantization process, the performance of the quantized model will deteriorate.

Now that we understand this, let's take a look at a new study by their team called DuQuant.

First, they found that there were obvious Massive Outliers (very large activation values) in the down_proj layer of the LLM feedforward network (FFN) module.

This type of outliers is different from the Normal Outliers found in the past. It is manifested as outliers with an absolute value greater than several hundred and confined to individual tokens.

They cause existing quantization algorithms (such as SmoothQuant and OmniQuant) to perform poorly when quantizing the model's weights and activation values to 4-bit binary numbers.

In response to this, the team proposed a new quantification method called DuQuant .

DuQuant transfers outliers to other channels within the Activation matrix by learning rotation and permutation transformation matrices, and finally obtains a smooth activation matrix, which greatly reduces the difficulty of quantization.

Experiments show that using the DuQuant method, the model reaches SOTA under the setting of 4-bit weight and activation quantization .

At the same time, DuQuant's training is very fast and can complete the quantization process of the 7B model within 50 seconds , which is plug-and-play .

background

In the common modules of each Transformer block, Multi-head Self-Attention (MSA) and Feed-forward Network (FFN) are basically composed of linear layers , which can be expressed as:

Where is the activation input and represents the weight matrix.

Model quantization reduces memory usage by converting model weights or activation values represented by floating-point numbers (FP16, BF16) into low-bit floating-point numbers or integers. Weight activation quantization can further speed up model inference by using low-bit multiplication operators.

This work focuses on low-bit integer quantization , with the goal of getting better hardware support.

Specifically, the ????-bit quantization process maps FP16 tensors ???? to low-order integers ???? _???? :

The symbol ⎣ · ⎤ indicates the nearest rounding operation, ∆ is the quantization step size, and ???? indicates the zero point.

Following the mainstream quantization method, the author adopts token-by-token quantization for the activation ???? and channel-by-channel quantization for the weight ????, which means assigning a different step size ( ∆ _???? ∊ ℝ ^????x1 ) to each token of ???? and a different step size to each output channel of ???? .

Motivation

According to the authors, they found for the first time in their experiments that the down_proj layer in the LLM FFN module has obvious massive outliers (very large activation values) , which are manifested as outliers with absolute values greater than several hundred and confined to individual tokens.

PS: Previous work found that Massive Outliers exist in the output of each transformer block, and the authors of DuQuant further located them in the FFN module.

Paper: https://link.zhihu.com/?target=https%3A//eric-mingjie.github.io/massive-activations/index.html

These Massive Outliers cause algorithms such as SmoothQuant and OmniQuant to perform poorly in 4-bit WA quantization.

△ Figure 1: Massive outliers significantly increase the difficulty of quantizing low-bit weight activations

Figure 1(a)(b) compares the commonly seen Normal Outliers and the Massive Outliers that appear in FFN .

SmoothQuant attempts to transfer the quantization difficulty from activations to weights by dividing the activations by a per-channel smoothing factor and multiplying it back into the weight matrix.

Specifically, SmoothQuant uses a channel-by-channel smoothing diagonal matrix, denoted as ????, to reformulate the original linear layer as: ???=???? · ????=（???? · ????）（???? ^-1 · ????）, and the element ???? ???? in the diagonal matrix ???? _is calculated as:

Where α is a hyperparameter representing the migration strength.

However, the author observed that performing such a shift on the input side may cause the weight matrix to also have obvious outliers that are difficult to quantify (as shown in Figure 1(d)). The root cause of this problem is that Massive Outliers make the smoothing factor ???? _???? become abnormally large.

In addition, extremely large outliers may also cause the problem of gradient explosion in optimization-based methods , so gradient-optimized methods such as OmniQuant and AffineQuant will directly skip the down_proj layer and directly degenerate into the processing method of SmoothQuant.

These preliminary experiments all indicate that a newer and better way is needed to handle both types of outliers, especially to smooth out the massive outliers on the down_proj input side .

method

DuQuant proposed to transfer outliers to other channels within the Activation matrix by learning rotation and permutation transformation matrices, and finally obtain a smooth activation matrix, thereby greatly reducing the difficulty of quantization.

(a) shows step-by-step how the DuQuant algorithm handles Normal outliers, (b) DuQuant significantly reduces Massive outliers, (c) a Tony Example shows that DuQuant effectively reduces the difficulty of quantizing the activation matrix.

△ Figure 2: DuQuant algorithm description

In simple terms, the DuQuant algorithm consists of three steps:

1) The construction of the rotation matrix effectively utilizes the position index of a specific outlier channel. The author uses a block-diagonal rotation matrix and spreads the outlier to other channels through a greedy algorithm within each block.

2) Due to the limitation of block size, the average value of some blocks after rotation may be greater than that of other blocks. Therefore, the author further uses the channel permutation technique to reallocate the activation channel and uses the zigzag sequence to significantly reduce the variance of the mean values of each group.

3) A further rotation transformation is performed to achieve a more uniform activation distribution, which greatly reduces the difficulty of quantization.

Rotation Matrix : The author hopes to apply the rotation matrix ???? to transform the rows or columns to reduce the impact of Normal Outliers and Massive Outliers.

Since Massive Outliers are usually randomly distributed in the activation space, it is challenging to directly find the optimal rotation matrix ???????? that can alleviate outliers with a single rotation transformation.

To solve this problem, the author uses a greedy search method with prior knowledge to calculate the rotation matrix

Specifically, the calculation of includes the following steps:

1. Identify the feature dimensions where outliers are mainly concentrated, that is: , where ???? _???????? represents the element in the ????th row and ????th column of ????.

2. Based on the searched dimension, construct the rotation matrix as follows:

is the exchange matrix used to exchange the 1st and d ⁽¹⁾ th columns of the activation values, representing an orthogonally initialized rotation matrix with a uniform first row.

The purpose of this is to mitigate outliers in column 1 after the transformation.

To further increase randomness, keep the first column after mitigating outliers and randomly rotate the other columns by multiplying them with a random orthogonal matrix '???? ^' .

3. Let N be the number of steps of greedy search, then the approximate rotation matrix is, where . Each ???? ^{????
is constructed} according to formula (2) and the identified feature dimension d ^(????) .

This construction ensures that the approximate optimal rotation matrix can effectively mitigate outliers with larger magnitudes, rather than just using a randomly selected orthogonal rotation matrix.

However, directly constructing the entire rotation matrix is very time-consuming and will result in a large memory overhead.

In order to achieve fast matrix multiplication, referring to Training Transformer with 4ibts, the author chooses to approximate the rotation matrix in a block manner.

where represents the square matrix of the ????th block, which is constructed according to the above three steps. The number of blocks K is calculated by K=C _???n /2 ^{n
.}

Channel permutation matrix : Although block diagonal rotation matrix is used to improve time and storage efficiency, its focus on local information brings potential limitations to further reduce outliers.

Since the rotation transformation performed within each small block cannot integrate information across different blocks, there may be relatively large outliers in one block and smaller outliers in another block, resulting in high variance between different blocks.

Therefore, the authors propose to use the channel permutation matrix to balance the magnitude of outliers between different blocks.

Specifically, in each small block, the largest outlier in dimension d _{????
is recorded as O}_???? .

Meanwhile, M _{b????
represents the average value of all O}_???? in the ????th block , where ????=1, 2, …, K, and the variance of the activation amplitude between each block can be expressed as:

The authors introduced the zigzag permutation matrix P.

Specifically, by generating a zigzag sequence, the channel with the highest activation value is first assigned to the first block, and then the channels with the second highest activation value are assigned to subsequent blocks in descending order until the Kth block.

After reaching the last block, the order is reversed and the assignments are made in increasing order, starting with the channel with the next highest activation value.

This reciprocating pattern runs through all the blocks, ensuring that no single block consistently receives the highest or lowest activation values for the channel.

By using zigzag permutation, DuQuant achieves a balanced distribution of outliers between different blocks, which enables the use of additional rotation transformations to further smooth outliers, as shown in Figure 2.

It should be noted that :

1. Channel replacement In fact, it is a very important step, and it is also simple and fast (it has little impact on inference efficiency, as shown in the following experimental section). It can avoid the complex training process like SpinQuant, and it also performs better than the Hadamard rotation of QuaRot.

2. Both the rotation matrix and the permutation transformation matrix are orthogonal matrices, which ensure the invariance of the output. The author also proved through rigorous theoretical derivation that the two transformations effectively reduce the quantization error. For specific proof, please read the Appendix in the paper.

experiment

DuQuant achieved SOTA results under the 4-bit setting and verified the LLaMA, Vicuna, and Mistral series of models, which significantly improved the performance of quantized models in tasks such as PPL, QA, MMLU, and MT-Bench.

In addition, the author also evaluated the ability of the quantitative model to generate long texts on LongBench, and DuQuant also significantly exceeded the baselines.

△ DuQuant significantly outperforms the baseline method in low-bit quantization of LLaMA3-8B

The above is the quantization effect of DuQuant on the LLaMA3 model. For more performance on models and tasks, please refer to the paper.

Hardware speed test It also proves that DuQuant can achieve a 2.08x acceleration ratio in the pre-filing stage and effectively reduce the memory overhead by 3.50x in the decoding stage.

At the same time, as shown in the right figure, DuQuant brings an additional speed overhead of about 10% compared to INT4 inference, which is slightly higher than QuaRot, but brings more performance improvements.

In addition, DuQuant differs from QuaRot, which uses the Hadamard rotation matrix, in the following two main aspects :

1. The rotation matrix constructed by DuQuant uses prior knowledge (specific outlier channel index), so it can smooth the activation space better than QuaRot. The following figure shows the transformation effect of DuQuant single rotation and Hadamard rotation on the LLaMA2-7B Attention Key_proj input.

2. QuaRot relies on the time-consuming GPTQ algorithm to improve performance, while the channel permutation matrix introduced by the author can help DuQuant further balance the distribution of outliers in a very short time. The two orthogonal transformations can simultaneously smooth the space of the weight matrix and reduce the difficulty of quantizing the weight matrix, thereby achieving better results.

In summary, DuQuant uses two orthogonal transformations and the prior knowledge of activation values to achieve better quantization results than the Hadamard rotation in QuaRot.

This work received consistent high praise from the reviewers and was eventually selected as an Oral Presentation with an acceptance rate of 0.4%.

For more details, please refer to the original paper.

Project homepage:
https://duquant.github.io/Paper
:
https://arxiv.org/abs/2406.01721Code
:
https://github.com/Hsu1023/DuQuant