The rapid development of large language models (LLMs) has revolutionized chat systems, enabling unprecedented levels of performance, such as Open's ChatGPT. However, the training and architectural details of ChatGPT remain unclear, hindering research and open source innovation in this area. Inspired by the Meta LLaMA and Stanford Alpaca projects, members from UC Berkeley, CMU, Stanford University, and UC San Diego have jointly launched Vicuna-13B, an open source chatbot supported by enhanced datasets and easy-to-use, scalable infrastructure.
According to the introduction, by fine-tuning the LLaMA base model based on user shared conversations collected from ShareGPT.com (a website where users can share their ChatGPT conversations), Vicuna-13B demonstrated competitive performance compared with other open source models such as Stanford Alpaca.
Preliminary evaluation using GPT-4 as the benchmark shows that Vicuna-13B achieves more than 90% of the quality of OpenAI ChatGPT and Google Bard, while exceeding the performance of other models such as LLaMA and Stanford Alpaca in more than 90% of cases. Training Vicuna-13B costs about $300. The training and serving code, as well as the online demo, are publicly available and can be used for non-commercial use.
To ensure data quality, the Vicuna team converted HTML back to markdown and filtered out some inappropriate or low-quality samples. As well as dividing lengthy conversations into smaller parts to fit the maximum context length of the model. Its training method is based on Stanford Alpaca, with the following improvements:
Memory optimization: To enable Vicuna to understand long contexts, the development team expanded the maximum context length from 512 in Alpaca to 2048, which significantly increased the memory requirements. Memory pressure was addressed by leveraging ulizing gradient checkpointing and flash atntion.
Multi-turn conversations: The training loss is adjusted to account for multi-turn conversations, and the fine-tuning loss is calculated only on the chatbot's output.
Reduce costs through spot instances: 40 times larger datasets and 4 times longer training sequences pose considerable challenges to training costs. The Vicuna team used Skylot managed spot to reduce costs by leveraging cheaper spot instances with auto-recovery preemption and auto-region switching. This solution cut the training cost of the 7B model from $500 to around $140, and the training cost of the 13B model from around $1,000 to $300.
The Vicuna team built a service system that is able to serve multiple models using distributed workers; it supports flexible plugins for GPU workers from local clusters and clouds. By leveraging the fault-tolerant controller and managed spot features in SkyPilot, the service system works well with cheaper spot instances from multiple clouds to reduce service costs. It is currently a lightweight implementation and will work hard to integrate more research results in the future.
Specifically, the development team first collected about 70,000 conversations from ShareGPT.com, and then enhanced the training scripts provided by Alpaca to better handle multi-round conversations and long sequences; the training was completed using PyTorch FSDP on 8 A100 GPUs in one day. In order to provide demonstration services, they also implemented a lightweight distributed serving system. A preliminary evaluation of the model quality was conducted by creating a set of 80 different questions and using GPT-4 to judge the model output. To compare two different models, team members combined the output of each model into a single prompt for each question. The prompt is then sent to GPT-4, which evaluates which model provides a better response.
The detailed comparison of LLaMA, Alpaca, ChatGPT and Vicuna is as follows:
The Vicun team showed examples of Alpaca and Vicuna answering benchmark questions. After fine-tuning Vicuna using ChatGPT conversations shared by 70K users, they found that Vicuna was able to generate more detailed and well-structured answers than Alpaca, and the quality was comparable to ChatGPT.
For example, when asked to "write an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions", GPT-4's evaluation scores were: Alpaca-13b 7/10, Vicuna-13b 10/10. The reason was that Alpaca provided a brief overview of the travel blog post, but did not actually write the blog post as required, resulting in a low score. Vicuna-13b wrote a detailed and attractive travel blog post about a recent trip to Hawaii, emphasizing cultural experiences and must-see attractions, which fully met the user's requirements and therefore received a higher score.
Meanwhile, preliminary findings from Vicun suggest that GPT-4 can produce highly consistent grades and detailed evaluations when comparing chatbot answers. The preliminary evaluation based on GPT-4 summarized in the figure below shows that Vicuna reaches 90% of the capabilities of Bard/ChatGPT. However, in general, building an evaluation system for chatbots is still an open problem that requires further research.
The Vicun team proposed a GPT-4-based evaluation framework to automatically evaluate the performance of chatbots. Eight question categories were designed to test various aspects of chatbot performance. Ten questions were selected based on each category, and answers were generated by LLaMA, Alpaca, ChatGPT, Bard, and Vicuna, respectively. GPT-4 was then asked to evaluate the quality of the answers based on usefulness, relevance, accuracy, and details. The results showed that GPT-4 can not only produce relatively consistent scores, but also explain in detail why such scores are given (detailed example link). But GPT-4 is not very good at judging coding/mathematical tasks.
The data shows that in more than 90% of questions, GPT-4 prefers answers generated by Vicuna over LLaMA, Alpaca, etc., and it achieves performance competitive with proprietary models (ChatGPT, Bard). In 45% of questions, GPT-4 rated Vicuna's answers as better than or equal to ChatGPT's answers.
Overall, despite recent industry developments, the fact is that chatbots still face limitations, such as difficulty solving basic math problems or limited coding abilities. Developing a comprehensive, standardized evaluation system for chatbots is also an open question that requires further research.
The development team acknowledges that Vicuna is not good at tasks involving reasoning or mathematics, and may have limitations in accurately identifying itself or ensuring the factual accuracy of its output. In addition, it has not been fully optimized to ensure safety or mitigate potential toxicity or bias. To address safety issues, they used the OpenAI moderation API to filter out inappropriate user input in online demonstrations.
Reviewing Editor: Li Qian
Previous article:Introduction to related variables of KUKA robot smartPAD screen space mouse
Next article:Palletizing robot encoder
- Popular Resources
- Popular amplifiers
- Using IMU to enhance robot positioning: a fundamental technology for accurate navigation
- Researchers develop self-learning robot that can clean washbasins like humans
- Universal Robots launches UR AI Accelerator to inject new AI power into collaborative robots
- The first batch of national standards for embodied intelligence of humanoid robots were released: divided into 4 levels according to limb movement, upper limb operation, etc.
- New chapter in payload: Universal Robots’ new generation UR20 and UR30 have upgraded performance
- Humanoid robots drive the demand for frameless torque motors, and manufacturers are actively deploying
- MiR Launches New Fleet Management Software MiR Fleet Enterprise, Setting New Standards in Scalability and Cybersecurity for Autonomous Mobile Robots
- Nidec Drive Technology produces harmonic reducers for the first time in China, growing together with the Chinese robotics industry
- DC motor driver chip, low voltage, high current, single full-bridge driver - Ruimeng MS31211
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Embedded Hardware Design (Second Edition)
- STEVAL-MKI109V3 Evaluation Method for LIS25BA
- Detailed explanation of Linux device driver development: based on the latest Linux 4.0 kernel
- Embedded Development Electronic Technology Classic Data 12G
- Purpose of inductor: choke, filter, oscillation
- Today's Live Broadcast | Use ModusToolbox to build a system to flexibly meet IoT design challenges
- AD7190 strange problem
- A 6-year STC fan contribution: a general low-level driver function library based on the STC8 series
- The new TMS320F28377S cannot connect to the emulator, what should I do?
- The digital power supply power line & feedback line do not correspond to the battery