Alibaba releases two large audio generation models at once and also open source! Fast understanding of 50 languages + speech generation in 5 languages, with emotions
FunAudioLLM team contribution
Quantum Bit | Public Account QbitAI
Although OpenAI has been slow to launch the GPT-4o voice assistant, its other large audio generation model results have been released one after another, and the key is that they are open source.
Just now, Ali Tongyi Laboratory also took action——
The latest open source voice model project FunAudioLLM is released , and it contains two models at once: SenseVoice and CosyVoice .
SenseVoice focuses on high-precision multi-language speech recognition, emotion recognition and audio event detection , supporting recognition in more than 50 languages. Its performance is better than the Whisper model, with an improvement of more than 50% in Chinese and Cantonese.
It also has strong emotion recognition capabilities and supports detection of many common human-computer interaction events such as music, applause, laughter, crying, coughing, sneezing, etc., and has achieved SOTA in multiple tests.
CosyVoice focuses on natural speech generation, supports multiple languages, timbre and emotion control , and supports the generation of five languages: Chinese, English, Japanese, Cantonese and Korean. The effect is significantly better than traditional speech generation models.
With only 3 to 10 seconds of original audio, CosyVoice can generate simulated timbre, even including details such as rhythm and emotion, including cross-language speech generation.
Moreover, CosyVoice supports fine-grained control over the emotion and rhythm of generated speech in the form of rich text or natural language, which significantly improves the emotional expressiveness of raw audio.
Without further ado, let’s take a look at the uses and effects of FunAudioLLM.
What can FunAudioLLM be used for?
Based on the SenseVoice and CosyVoice models, FunAudioLLM can support more human-computer interaction application scenarios, such as multilingual speech translation with timbre and emotion generation, emotional voice conversations, interactive podcasts, audiobooks, etc.
Simultaneous Interpretation: Multilingual Translation Simulating Timbre and Emotion
By combining SenseVoice, LLM, and CosyVoice, speech-to-speech translation (S2ST) can be performed seamlessly .
It should be noted that the original recording will be displayed in bold in the text. This integrated approach not only improves the efficiency and fluency of the translation, but also by sensing the emotions and intonation in the speech, it can reproduce the emotional color of the original speech in the translation, making the conversation more real and moving.
Whether it is multilingual conference interpretation, cross-cultural communication, or providing instant voice translation services for non-native speakers, this technology will greatly narrow the language gap and information loss in communication.
for example:
Voice dialogue with strong emotional interaction
By integrating SenseVoice, Large Language Model (LLM) and CosyVoice, it is possible to support the development of an emotional voice chat application.
After SenseVoice analyzes paralinguistic information such as emotions/feelings/coughs, the large model outputs the corresponding feedback emotions, and CosyVoice generates appropriate voice emotions , thus completing a comfortable and natural conversation interaction process.
In the following examples, all conversations between the user and the assistant are generated by CosyVoice.
It sounds like this:
Exclusive AI blog radio
By integrating SenseVoice, an LLM-based multi-agent system with real-time world knowledge, and CosyVoice, an interactive podcast radio station can be created.
In such podcasts, SenseVoice uses its high-precision multilingual speech recognition capabilities to capture the conversation between the AI podcaster and the user in real time, and can even identify environmental sounds and emotions .
The LLM multi-agent system can process the voice data provided by SenseVoice, update the world knowledge base in real time, and ensure the timeliness and accuracy of topics and information. During the interaction, users can interrupt the conversation of the AI podcast at any time to guide the direction of the topic, etc. CosyVoice will be used to generate the voice of the AI podcast, which has the ability to control multiple languages, timbres and emotions, bringing a rich and colorful listening experience to the audience.
It sounds like this:
Audiobooks
With the help of LLM's excellent analytical capabilities, the content of books can be structured and the emotions in them can be identified. Combined with CosyVoice's speech generation technology, more expressive audiobooks can be achieved.
LLM deeply understands the text, capturing every emotional fluctuation and story arc, while CosyVoice delicately translates these emotions into speech with specific emotional colors and emphasis , providing listeners with a listening experience that is not only colorful but also emotionally rich.
Such audiobooks are no longer a single, unchanging reading, but an auditory feast full of emotion and vivid expression, making each story and character come alive.
It sounds like this:
Analysis of FunAudio LLM Technology Principle
CosyVoice
CosyVoice is a large speech generation model based on speech quantization coding.
It discretizes speech and relies on large model technology to achieve a natural and smooth speech generation experience. Compared with traditional speech generation technology, CosyVoice has the characteristics of natural rhythm and realistic timbre.
CosyVoice supports up to 5 languages and also supports fine-grained control of dimensions such as emotion over the generated speech in the form of natural language or rich text.
The research team provided the base model CosyVoice-300M, the SFT-fine-tuned model CosyVoice-300M-SFT, and the model CosyVoice-300M-Instruct that supports fine-grained control to meet the usage requirements in different scenarios.
Generate objective speech indicators
The research team tested the content consistency of synthesized audio through speech recognition on the open source Chinese dataset Aishell3 and the English dataset LibriTTS.
By comparing with the original audio and the recently popular ChatTTS, it can be found that CosyVoice's synthesized audio is more consistent in content and rarely has the phenomenon of hallucinating extra words.
CosyVoice models the semantic information in the synthesized text very well, reaching a level comparable to that of human speakers. In addition, by rescoring the synthesized audio, it can further reduce the recognition error rate and even surpass humans in content consistency and speaker similarity.
Emotional control ability
The research team also used a pre-trained emotion classification model to evaluate CosyVoice's emotion control ability, mainly including five highly expressive voice emotions: happy/sad/angry/fearful/disgusted.
The test results show that CosyVoice-300M itself has a certain ability to infer emotions from text content. The model CosyVoice-300M-Instruct, which has undergone fine-grained control training, scores higher in emotion classification and has stronger emotion control capabilities.
SenseVoice
SenseVoice is a basic speech understanding model with multiple speech understanding capabilities, covering automatic speech recognition (ASR) , language identification (LID) , emotion recognition (SER) and audio event detection (AED) .
The model is designed to provide comprehensive speech processing capabilities, thereby supporting the construction of more complex speech interaction systems.
SenseVoice-Small is a lightweight basic speech model with only an encoder, designed for fast speech understanding. It can process speech data quickly and respond quickly when needed, suitable for delay-sensitive applications such as real-time voice interaction systems.
SenseVoice-Large is a large basic voice model that includes an encoder and a decoder. This version of SenseVoice focuses on more accurate voice understanding and supports more languages. It is suitable for scenarios with higher requirements for recognition accuracy, can handle more complex voice inputs, and generate more accurate results.
Multilingual speech recognition performance
The research team compared the multi-language recognition performance and inference efficiency of SenseVoice and Whisper on open source datasets, including AISHELL-1, AISHELL-2, Wenetspeech, Librispeech, and Common Voice.
The inference efficiency evaluation was performed on an A800 machine. SenseVoice-Small uses a non-autoregressive end-to-end architecture, which results in extremely low inference latency - by comparison, it is 7 times faster than Whisper-Small and 17 times faster than Whisper-Large.
Speech emotion recognition performance
SenseVoice can also be used for discrete emotion recognition, and currently supported emotion types include happy, sad, angry, and neutral.
The team evaluated it on 7 popular emotion recognition datasets. Even without fine-tuning on the target corpus, SenseVoice-Large was able to achieve or surpass the latest state-of-the-art results (SOTA) on most datasets .
Audio Event Detection Performance
Both the SenseVoice-Small and SenseVoice-Large models can detect audio events in speech, including music, applause, and laughter.
In addition to predicting the type of audio events, the SenseVoice-Large model can also accurately identify the start and end locations of the events.
In comparison, the SenseVoice-Small model can only predict the type of event occurring in the audio (limited to one event) , but it can detect a wider variety of events, such as coughing, sneezing, breathing, and crying that may occur during human-computer interaction.
Currently, the models related to SenseVoice and CosyVoice have been open sourced on ModelScope and Huggingface, and the corresponding training, inference, and fine-tuning codes have been released on GitHub.
If you are interested, you can copy the link below or click Read original text to obtain it.
FunAudioLLM: https://github.com/FunAudioLLM
CosyVoice open source repository: https://github.com/FunAudioLLM/CosyVoice
CosyVoice online experience: https://www.modelscope.cn/studios/iic/CosyVoice-300M
SenseVoice open source repository: https://github.com/FunAudioLLM/SenseVoice
SenseVoice online experience: https://www.modelscope.cn/studios/iic/SenseVoice
Please send your submissions to:
ai@qbitai.com
Please indicate [Submission] in the title and tell us:
Who are you, where are you from, what is your contribution
Attach the link to the paper/project homepage and contact information
We will (try to) reply you promptly
Click here ???? Follow me, remember to mark the star~