Article count:10350 Read by:146647018

Account Entry

New work by Chen Danqi’s team: The amount of data is reduced by 95%, and the performance of large models is even stronger! Less is More

Latest update time:2024-02-10
    Reads:
Baijiao House comes from Aofei Temple
Qubit | Official account QbitAI

The cost of building large models has been reduced again!

This time, the data volume was reduced by 95% .

Chen Danqi’s team recently proposed a method to reduce costs for large models——

The data selection algorithm LESS only selects 5% of the data most relevant to the task for instruction fine-tuning, which is even better than using the entire data set.

Instruction fine-tuning is a key step in making the basic model become a ChatGPT assistant model.

In this way, it will be cheaper and more efficient for the large model industry to specialize.

More importantly, the selected training data is also transferable and can be applied to other large models and various types of models as long as it is targeted at a specific task.

Come and take a look at what this freshly released paper says?

LESS algorithm

At present, instruction fine-tuning unlocks the power of large models and can effectively utilize combined data sets to develop ChatBot.

But the challenge is how to identify the most relevant data from these data sets to train specialized skills. This situation is called targeted instruction fine-tuning.

To solve this problem, the researchers designed an optimizer-aware approach to select these data, inspired by past work on using gradient information to estimate the impact of individual training data points.

LESS (Low-rank gradiEnt Similarity Search) , in short, gives priority to using data that directly helps the target task for training, rather than relying on surface features.

It is mainly divided into four steps.

First, a small subset is extracted from the training data set and a selection model is trained using LoRA.

Subsequently, Adam LoRA gradient features are calculated for individual training data points and saved in the gradient data repository.

The third step is to select data. For tasks with a small number of examples (with multiple subtasks), the researchers compute gradient features for each validation subtask. Then select the top 5% training subset from the repository.

Finally, train the target model. The model can be trained using LoRA or fully fine-tuned.

The first and second steps can be operated offline, and each candidate training set D only needs to be calculated once.

To sum up, LESS has the following characteristics:

  • Compatible with Adam optimizer . LESS combines gradient information with optimizer status to study the impact of data on model performance.

  • Efficient . LESS uses LoRA and random projection to build a gradient data store that has low-dimensional, easy-to-operate gradient characteristics, allowing efficient and effective selection of data sets. Gradient data storage is reusable for new target tasks.

In the final evaluation results, in the tasks of MMLU, TydiQA and BBH, training a large model with 5% of the data volume is better than training the entire data set.

And compared to random selection, LESS performance is consistently 2 to 5 percentage points higher, indicating that this method is very effective.

In addition, they specifically found that LESS is transferable .

The data results selected on LLAMA-2-7B, the performance on LLAMA-2-13B and MISTRAL-7B (list LESS-T) are also better.

Some even perform better than using LESS yourself (list LESS) .

Beyond that, there's interpretability . LESS selects data with similar reasoning and skill types as the target task, while existing methods (such as BM25, RDS) often select data based only on surface formal clues (such as language or text).

Produced by Chen Danqi’s team

The authors of the paper are researchers from Princeton University and the University of Washington.

Princeton computer science doctoral students Xia Mengzhou and Sadhika Malladi are co-authors.

Among them, Xia Mengzhou graduated from Fudan University with a bachelor's degree and a master's degree from CMU. He is currently a student of Chen Danqi.

According to Chen Danqi’s personal homepage, “These days, I am mainly attracted by the development of large models.” The topics he is researching include:

  • Retrieve How to play an important role in the next generation of models to improve authenticity, adaptability, interpretability and trustworthiness.

  • Low-cost training and deployment of large models , improved training methods, data management, model compression and downstream task adaptation optimization.

  • Also interested in work that truly improves understanding of the capabilities and limitations of current large models , both empirically and theoretically.

Some time ago, they proposed the popular "alpaca shearing" method——

The LLM-Shearing large model pruning method uses only 3% of the calculation amount and 5% of the cost to achieve SOTA, dominating the 1B-3B scale of open source large models.

The first half of large model scientific research is to put parameters into practice. In the second half, less is more, smaller parameters, better results, helping large models to be implemented faster in more fields.

Paper link:
https://arxiv.org/abs/2402.04333

-over-

Click here ???? Follow me and remember to star~

Three consecutive clicks of "Share", "Like" and "Watching"

Advances in cutting-edge science and technology are seen every day ~



Latest articles about

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号