Low-cost open source chatbot Vicuna: can reach more than 90% of ChatGPT/Bard level-EEWORLD

Collect

The rapid development of large language models (LLMs) has revolutionized chat systems, enabling unprecedented levels of performance, such as Open's ChatGPT. However, the training and architectural details of ChatGPT remain unclear, hindering research and open source innovation in this area. Inspired by the Meta LLaMA and Stanford Alpaca projects, members from UC Berkeley, CMU, Stanford University, and UC San Diego have jointly launched Vicuna-13B, an open source chatbot supported by enhanced datasets and easy-to-use, scalable infrastructure.

According to the introduction, by fine-tuning the LLaMA base model based on user shared conversations collected from ShareGPT.com (a website where users can share their ChatGPT conversations), Vicuna-13B demonstrated competitive performance compared with other open source models such as Stanford Alpaca.

Preliminary evaluation using GPT-4 as the benchmark shows that Vicuna-13B achieves more than 90% of the quality of OpenAI ChatGPT and Google Bard, while exceeding the performance of other models such as LLaMA and Stanford Alpaca in more than 90% of cases. Training Vicuna-13B costs about $300. The training and serving code, as well as the online demo, are publicly available and can be used for non-commercial use.

To ensure data quality, the Vicuna team converted HTML back to markdown and filtered out some inappropriate or low-quality samples. As well as dividing lengthy conversations into smaller parts to fit the maximum context length of the model. Its training method is based on Stanford Alpaca, with the following improvements:

Memory optimization: To enable Vicuna to understand long contexts, the development team expanded the maximum context length from 512 in Alpaca to 2048, which significantly increased the memory requirements. Memory pressure was addressed by leveraging ulizing gradient checkpointing and flash atntion.

Multi-turn conversations: The training loss is adjusted to account for multi-turn conversations, and the fine-tuning loss is calculated only on the chatbot's output.

Reduce costs through spot instances: 40 times larger datasets and 4 times longer training sequences pose considerable challenges to training costs. The Vicuna team used Skylot managed spot to reduce costs by leveraging cheaper spot instances with auto-recovery preemption and auto-region switching. This solution cut the training cost of the 7B model from $500 to around $140, and the training cost of the 13B model from around $1,000 to $300.

The Vicuna team built a service system that is able to serve multiple models using distributed workers; it supports flexible plugins for GPU workers from local clusters and clouds. By leveraging the fault-tolerant controller and managed spot features in SkyPilot, the service system works well with cheaper spot instances from multiple clouds to reduce service costs. It is currently a lightweight implementation and will work hard to integrate more research results in the future.

Specifically, the development team first collected about 70,000 conversations from ShareGPT.com, and then enhanced the training scripts provided by Alpaca to better handle multi-round conversations and long sequences; the training was completed using PyTorch FSDP on 8 A100 GPUs in one day. In order to provide demonstration services, they also implemented a lightweight distributed serving system. A preliminary evaluation of the model quality was conducted by creating a set of 80 different questions and using GPT-4 to judge the model output. To compare two different models, team members combined the output of each model into a single prompt for each question. The prompt is then sent to GPT-4, which evaluates which model provides a better response.

The detailed comparison of LLaMA, Alpaca, ChatGPT and Vicuna is as follows:

The Vicun team showed examples of Alpaca and Vicuna answering benchmark questions. After fine-tuning Vicuna using ChatGPT conversations shared by 70K users, they found that Vicuna was able to generate more detailed and well-structured answers than Alpaca, and the quality was comparable to ChatGPT.

For example, when asked to "write an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions", GPT-4's evaluation scores were: Alpaca-13b 7/10, Vicuna-13b 10/10. The reason was that Alpaca provided a brief overview of the travel blog post, but did not actually write the blog post as required, resulting in a low score. Vicuna-13b wrote a detailed and attractive travel blog post about a recent trip to Hawaii, emphasizing cultural experiences and must-see attractions, which fully met the user's requirements and therefore received a higher score.

Meanwhile, preliminary findings from Vicun suggest that GPT-4 can produce highly consistent grades and detailed evaluations when comparing chatbot answers. The preliminary evaluation based on GPT-4 summarized in the figure below shows that Vicuna reaches 90% of the capabilities of Bard/ChatGPT. However, in general, building an evaluation system for chatbots is still an open problem that requires further research.

The Vicun team proposed a GPT-4-based evaluation framework to automatically evaluate the performance of chatbots. Eight question categories were designed to test various aspects of chatbot performance. Ten questions were selected based on each category, and answers were generated by LLaMA, Alpaca, ChatGPT, Bard, and Vicuna, respectively. GPT-4 was then asked to evaluate the quality of the answers based on usefulness, relevance, accuracy, and details. The results showed that GPT-4 can not only produce relatively consistent scores, but also explain in detail why such scores are given (detailed example link). But GPT-4 is not very good at judging coding/mathematical tasks.

The data shows that in more than 90% of questions, GPT-4 prefers answers generated by Vicuna over LLaMA, Alpaca, etc., and it achieves performance competitive with proprietary models (ChatGPT, Bard). In 45% of questions, GPT-4 rated Vicuna's answers as better than or equal to ChatGPT's answers.

Overall, despite recent industry developments, the fact is that chatbots still face limitations, such as difficulty solving basic math problems or limited coding abilities. Developing a comprehensive, standardized evaluation system for chatbots is also an open question that requires further research.

The development team acknowledges that Vicuna is not good at tasks involving reasoning or mathematics, and may have limitations in accurately identifying itself or ensuring the factual accuracy of its output. In addition, it has not been fully optimized to ensure safety or mitigate potential toxicity or bias. To address safety issues, they used the OpenAI moderation API to filter out inappropriate user input in online demonstrations.

Reviewing Editor: Li Qian

Reference address：Low-cost open source chatbot Vicuna: can reach more than 90% of ChatGPT/Bard level

Previous article：Introduction to related variables of KUKA robot smartPAD screen space mouse
Next article：Palletizing robot encoder

Popular Resources
Popular amplifiers

Latest robot Articles

Using IMU to enhance robot positioning: a fundamental technology for accurate navigation
Abstract This article highlights the importance of inertial measurement unit (IMU) sensors for robot positioning and outlines their main advantages. IMUs provide critical motion data and have become an essential component of accurate robot positioning. ...
Researchers develop self-learning robot that can clean washbasins like humans
On November 10, researchers at the Vienna University of Technology (TU Wien) developed a self-learning robot that can imitate humans to complete simple tasks, such as cleaning a washbasin. ...
Universal Robots launches UR AI Accelerator to inject new AI power into collaborative robots
On November 6, 2024, Universal Robots (UR), a global collaborative robot manufacturer, today released the UR AI Accelerator. This is a plug-and-play hardware ...
The first batch of national standards for embodied intelligence of humanoid robots were released: divided into 4 levels according to limb movement, upper limb operation, etc.
On October 29, according to the news released by Pudong, the Humanoid Robot and Embodied Intelligence Innovation Forum was held in Shanghai yesterday. The National and Local Governments jointly built the Humanoid Robot Innovation Center and joined hands with leading companies in the industry and ...
New chapter in payload: Universal Robots’ new generation UR20 and UR30 have upgraded performance
By enhancing the load capacity of large-load collaborative robots, Universal Robots can effectively improve customers' production throughput and overall production efficiency. On October 24, 2024, Universal Robots ( ...
Humanoid robots drive the demand for frameless torque motors, and manufacturers are actively deploying
MiR Launches New Fleet Management Software MiR Fleet Enterprise, Setting New Standards in Scalability and Cybersecurity for Autonomous Mobile Robots
Nidec Drive Technology produces harmonic reducers for the first time in China, growing together with the Chinese robotics industry
DC motor driver chip, low voltage, high current, single full-bridge driver - Ruimeng MS31211

MoreSelected Circuit Diagrams

Change More Related Popular Components

MorePopular Articles

MoreDaily News

Guess you like