Low-cost open source chatbot Vicuna: can reach more than 90% of ChatGPT/Bard level

Publisher:blq0681Latest update time:2023-04-06 Source: OSC开源社区Author: Lemontree Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

The rapid development of large language models (LLMs) has revolutionized chat systems, enabling unprecedented levels of performance, such as Open's ChatGPT. However, the training and architectural details of ChatGPT remain unclear, hindering research and open source innovation in this area. Inspired by the Meta LLaMA and Stanford Alpaca projects, members from UC Berkeley, CMU, Stanford University, and UC San Diego have jointly launched Vicuna-13B, an open source chatbot supported by enhanced datasets and easy-to-use, scalable infrastructure.

According to the introduction, by fine-tuning the LLaMA base model based on user shared conversations collected from ShareGPT.com (a website where users can share their ChatGPT conversations), Vicuna-13B demonstrated competitive performance compared with other open source models such as Stanford Alpaca.

Preliminary evaluation using GPT-4 as the benchmark shows that Vicuna-13B achieves more than 90% of the quality of OpenAI ChatGPT and Google Bard, while exceeding the performance of other models such as LLaMA and Stanford Alpaca in more than 90% of cases. Training Vicuna-13B costs about $300. The training and serving code, as well as the online demo, are publicly available and can be used for non-commercial use.

To ensure data quality, the Vicuna team converted HTML back to markdown and filtered out some inappropriate or low-quality samples. As well as dividing lengthy conversations into smaller parts to fit the maximum context length of the model. Its training method is based on Stanford Alpaca, with the following improvements:

Memory optimization: To enable Vicuna to understand long contexts, the development team expanded the maximum context length from 512 in Alpaca to 2048, which significantly increased the memory requirements. Memory pressure was addressed by leveraging ulizing gradient checkpointing and flash atntion.

Multi-turn conversations: The training loss is adjusted to account for multi-turn conversations, and the fine-tuning loss is calculated only on the chatbot's output.

Reduce costs through spot instances: 40 times larger datasets and 4 times longer training sequences pose considerable challenges to training costs. The Vicuna team used Skylot managed spot to reduce costs by leveraging cheaper spot instances with auto-recovery preemption and auto-region switching. This solution cut the training cost of the 7B model from $500 to around $140, and the training cost of the 13B model from around $1,000 to $300.

The Vicuna team built a service system that is able to serve multiple models using distributed workers; it supports flexible plugins for GPU workers from local clusters and clouds. By leveraging the fault-tolerant controller and managed spot features in SkyPilot, the service system works well with cheaper spot instances from multiple clouds to reduce service costs. It is currently a lightweight implementation and will work hard to integrate more research results in the future.

Specifically, the development team first collected about 70,000 conversations from ShareGPT.com, and then enhanced the training scripts provided by Alpaca to better handle multi-round conversations and long sequences; the training was completed using PyTorch FSDP on 8 A100 GPUs in one day. In order to provide demonstration services, they also implemented a lightweight distributed serving system. A preliminary evaluation of the model quality was conducted by creating a set of 80 different questions and using GPT-4 to judge the model output. To compare two different models, team members combined the output of each model into a single prompt for each question. The prompt is then sent to GPT-4, which evaluates which model provides a better response.

The detailed comparison of LLaMA, Alpaca, ChatGPT and Vicuna is as follows:

The Vicun team showed examples of Alpaca and Vicuna answering benchmark questions. After fine-tuning Vicuna using ChatGPT conversations shared by 70K users, they found that Vicuna was able to generate more detailed and well-structured answers than Alpaca, and the quality was comparable to ChatGPT.

For example, when asked to "write an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions", GPT-4's evaluation scores were: Alpaca-13b 7/10, Vicuna-13b 10/10. The reason was that Alpaca provided a brief overview of the travel blog post, but did not actually write the blog post as required, resulting in a low score. Vicuna-13b wrote a detailed and attractive travel blog post about a recent trip to Hawaii, emphasizing cultural experiences and must-see attractions, which fully met the user's requirements and therefore received a higher score.

Meanwhile, preliminary findings from Vicun suggest that GPT-4 can produce highly consistent grades and detailed evaluations when comparing chatbot answers. The preliminary evaluation based on GPT-4 summarized in the figure below shows that Vicuna reaches 90% of the capabilities of Bard/ChatGPT. However, in general, building an evaluation system for chatbots is still an open problem that requires further research.

The Vicun team proposed a GPT-4-based evaluation framework to automatically evaluate the performance of chatbots. Eight question categories were designed to test various aspects of chatbot performance. Ten questions were selected based on each category, and answers were generated by LLaMA, Alpaca, ChatGPT, Bard, and Vicuna, respectively. GPT-4 was then asked to evaluate the quality of the answers based on usefulness, relevance, accuracy, and details. The results showed that GPT-4 can not only produce relatively consistent scores, but also explain in detail why such scores are given (detailed example link). But GPT-4 is not very good at judging coding/mathematical tasks.

The data shows that in more than 90% of questions, GPT-4 prefers answers generated by Vicuna over LLaMA, Alpaca, etc., and it achieves performance competitive with proprietary models (ChatGPT, Bard). In 45% of questions, GPT-4 rated Vicuna's answers as better than or equal to ChatGPT's answers.

Overall, despite recent industry developments, the fact is that chatbots still face limitations, such as difficulty solving basic math problems or limited coding abilities. Developing a comprehensive, standardized evaluation system for chatbots is also an open question that requires further research.

The development team acknowledges that Vicuna is not good at tasks involving reasoning or mathematics, and may have limitations in accurately identifying itself or ensuring the factual accuracy of its output. In addition, it has not been fully optimized to ensure safety or mitigate potential toxicity or bias. To address safety issues, they used the OpenAI moderation API to filter out inappropriate user input in online demonstrations.

Reviewing Editor: Li Qian

Reference address:Low-cost open source chatbot Vicuna: can reach more than 90% of ChatGPT/Bard level

Previous article:Introduction to related variables of KUKA robot smartPAD screen space mouse
Next article:Palletizing robot encoder

Latest robot Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号