To build professional large models at a cost of one thousand yuan, system optimization + open source large models are the key | Luchen Bian Zhengda @MEET2024

Latest update time：2024-01-05

Reads：

Edited by the editorial department from MEET2024
Qubits | Public account QbitAI

In the first year of large-scale models, even base model manufacturers who are at the forefront of the trend cannot escape computing power anxiety .

On the one hand, due to the technical characteristics of the large model itself, the demand for computing power has doubled; on the other hand, the supply of computing power is tight, and "one card is hard to find" once became a common phenomenon in the industry.

But at the same time, trends wait for no one.

Therefore, how to efficiently utilize existing computing resources has become the route chosen by many manufacturers, which has also led to AI acceleration solutions and AI Infra becoming hot topics in the industry.

So what trends have players who specialize in providing acceleration solutions seen? What solutions were proposed? It's very critical.

For example, Bian Zhengda, CTO of Luchen Technology, mentioned:

The low-cost migration solution can use open source models to quickly create large vertical professional models.

Luchen Technology helps companies reduce the cost of implementing large models and improve training and reasoning efficiency by building a distributed AI development and deployment platform. The company completed four rounds of financing within 24 months of its establishment, the latest of which was nearly 100 million yuan in Series A+ financing.

In order to fully reflect Bian Zhengda's thinking on large model acceleration, Qubit edited the content of his speech without changing the original meaning. I hope it can also bring you new inspiration.

About MEET Intelligent Future Conference: MEET Conference is the top business summit in the field of intelligent technology hosted by Qubit, dedicated to exploring the implementation and industry applications of cutting-edge technology. This year, dozens of mainstream media and live broadcast platforms reported and broadcast the MEET2024 conference live, attracting more than 3 million industry users to participate online, and the total exposure of the entire network exceeded 20 million.

Key points of speech

The cost of training large models is high because of the large amount of data and the difficulty of deployment.
The core goal of Colossal-AI is to help different users implement large-scale AI model applications to the greatest extent while reducing costs and increasing efficiency.
The low-cost migration solution can use open source models to quickly create large vertical professional models.

Use distributed algorithms to lower the threshold for large model implementation

Hello everyone, I am Bian Zhengda, CTO of Luchen Technology. I am very honored to come to this conference to communicate with you about the challenges of AI large models and system optimization issues.

Our company has not been established very long and the team is relatively young. Under the leadership of Professor You Yang (Young Professor, President of the National University of Singapore) and Professor James Demmel of Berkeley, we launched the Colossal-AI large model distributed deployment optimization system, with the goal of lowering the threshold and implementation of large AI models. cost .

First, let’s introduce some background of the large model era and our original intention of developing the Colossal-AI system.

Looking back at the history of AI development, for example, the popular AI model ResNet in 2016 can be trained in just a few hours using a graphics card. Later, BERT also took a day or two to complete training.

But today, we have all been bombarded with different large models recently, and their research and development costs are orders of magnitude different.

For example, if Google's PaLM model is trained with an A100 graphics card, it will take 300 years and cost more than 9 million knives.

The cost is so high because we want to train a high-quality large model. First of all, the amount of training data is very large. Secondly, when we want to deploy a training and inference system for a large model, we need a huge system consisting of hundreds or thousands of graphics cards. Cluster, this cost is also very considerable.

Therefore, we launched the Colossal-AI system, hoping to use efficient distributed algorithms to reduce the development and deployment threshold of large AI models as much as possible, as well as the ultra-high cost.

Our framework bridges upper-layer AI applications, such as PyTorch, HuggingFace, and Lightning, and is also compatible with the deployment of different underlying hardware, such as GPU, TPU, NPU, etc., to help users complete deployment.

The core goal of Colossal-AI is to help different companies and users implement large-scale AI model applications to the greatest extent, while helping them reduce costs and increase efficiency.

Core technology includes three levels, namely:

Efficient memory management system
N-dimensional parallel management system
Low latency inference system.

Colossal-AI currently has a certain influence in the community and academia, and has gained certain recognition. We launched on GitHub more than a year ago and gained 35,000+ stars. Our core work has also been accepted by top academic conferences such as NeurIPS, SC, PPoPP and so on.

Below I will introduce the core design ideas in detail and explain how Colossal-AI achieves cost reduction and efficiency improvement.

How to train and utilize memory space efficiently

The first one is to look at N-dimensional parallel systems.

Before the development of the Colossal-AI system, parallel technologies in various scenarios were already on the market, including tensor parallelism, pipeline parallelism, data parallelism, etc.

We found that after more ordinary users get the actual needs, it is difficult for them to choose a truly suitable parallel solution and convert it into an actual solution. The core idea of our system is to integrate the most efficient parallel technology into a system , and use our long-term experience in system optimization to help different users choose appropriate parallel solutions while providing the most efficient implementation.

For example, in terms of one-dimensional data parallelism, we successfully used LARS and LAMB optimization technologies to expand the batch size to 34k and 64k.

You should know that in normal training, the batch size will not exceed 8k. It has a generalization threshold. If the batch size is too large, the final generalization will not be particularly ideal .

By fine-tuning the learning rate layer by layer through optimizers such as LARS and LAMB, we can expand the batch size to larger dimensions. In other words, as long as there is enough graphics card, the training time can be shortened as much as possible . For example, at that time Professor You Yang successfully compressed the BERT training time to more than an hour. This excellent result has been adopted by many companies, such as Google, Facebook, and NVIDIA.

In addition, we can also perform model parallelism on large models, including tensor parallelism, pipeline parallelism, etc.

For long sequences, you can also use sequence parallel optimization, which can not only evenly divide the huge graphics memory overhead, but also achieve efficient calculation and communication. I would like to mention in particular, like sequence parallelism, we also know that DeepSpeed has the idea of sequence parallelism. But if you read their code carefully, you will find that when they calculate Attention, the sequence dimension is not actually cut.

In our system, we have successfully performed a segmentation calculation on the dimension of the sequence from beginning to end. The most important point here is that the Attention calculation needs to operate on the complete sequence. We have successfully divided different cards through the ring algorithm. The subsequence above completes Attention synchronization. After such segmentation, as long as we have enough cards, the training sequence can be infinitely long, which is very consistent with the current industry trend of continuously launching longer sequence models.

The second efficient memory management system .

During the training process of deep learning, you will find that the parts with heavy calculations are concentrated in the parts with relatively small storage overhead, while the parts with relatively large storage overhead are concentrated in the parameter update of the optimizer.

Our idea is to put some relatively redundant storage overhead on cheaper storage devices, such as a cache on the CPU storage device, and put storage focused on computing on the GPU, successfully lowering the storage threshold for large models.

In our system, more efficient parameter storage is achieved through the adaptive management system. If all redundant storage is placed on the CPU, it will cause frequent data movement between the CPU and the GPU. Currently, there is still a bottleneck in the bandwidth between different levels of storage, so we try our best to put the storage on the GPU. The part that exceeds the upper limit is cached on the CPU to minimize data movement and achieve more efficient results.

In addition, we implemented the Chunk management system. Some ideas are borrowed here. For example, in PyTorchDDP, some communication storage is released through Bucket to improve communication efficiency as much as possible. We can apply the same idea to things like Zero parallelism or tensor parallelism. By aggregating different Tensors through Chunk, we can also manage heterogeneous storage more flexibly.

As you can see below, through the above system optimization, we have successfully achieved training inference acceleration and also lowered the threshold for training large model hardware.

With its low threshold and high efficiency, our system can help us quickly follow some of the more popular scenarios in the current AI field. For example, at the beginning of the year, the complete ChatGPT RLHF solution was open sourced and the multi-round conversation function of the Colossal-Chat product was launched.

At the same time, we also have rich accumulation in algorithms , which can not only reproduce, but also make good use of the current rich open source large models.

Taking the enhancement of the Chinese capabilities of the English basic model LLaMA 2 as an example , we have successfully significantly improved the Chinese and English capabilities of LLaMA 2 using only a data volume of no more than 8.5B tokens and a thousand yuan of computing power . And the effect is comparable to other expensive pre-trained Chinese large models from scratch .

More importantly, this low-cost solution can migrate open source large models to any vertical field with a very low development threshold, bringing low-cost customized high-quality professional large models.

Therefore, our solution has gained certain recognition in the community and was selected as the official base model by NeurIPS. At the same time, the number of downloads on HuggingFace is also considerable.

Finally, this year we also launched some products that can help more users develop large model applications at a low threshold, such as a cloud platform that integrates integrated solutions such as training, fine-tuning and deployment, and an all-in-one large model workstation. The all-in-one part has made the ultimate in software and hardware. Optimized and packaged with a very rich model, it can be used out of the box and deploy more than 100 billion models on an all-in-one machine.

Finally, everyone is very welcome to participate in our community and build the Colossal-AI and large model ecosystem together. Thank you all.

-over-

Click here ???? Follow me and remember to star~