Article count:10350 Read by:146647018

Account Entry

Alibaba releases two large audio generation models at once and also open source! Fast understanding of 50 languages ​​+ speech generation in 5 languages, with emotions

Latest update time:2024-07-05
    Reads:
FunAudioLLM team contribution
Quantum Bit | Public Account QbitAI

Although OpenAI has been slow to launch the GPT-4o voice assistant, its other large audio generation model results have been released one after another, and the key is that they are open source.

Just now, Ali Tongyi Laboratory also took action——

The latest open source voice model project FunAudioLLM is released , and it contains two models at once: SenseVoice and CosyVoice .

SenseVoice focuses on high-precision multi-language speech recognition, emotion recognition and audio event detection , supporting recognition in more than 50 languages. Its performance is better than the Whisper model, with an improvement of more than 50% in Chinese and Cantonese.

It also has strong emotion recognition capabilities and supports detection of many common human-computer interaction events such as music, applause, laughter, crying, coughing, sneezing, etc., and has achieved SOTA in multiple tests.

CosyVoice focuses on natural speech generation, supports multiple languages, timbre and emotion control , and supports the generation of five languages: Chinese, English, Japanese, Cantonese and Korean. The effect is significantly better than traditional speech generation models.

With only 3 to 10 seconds of original audio, CosyVoice can generate simulated timbre, even including details such as rhythm and emotion, including cross-language speech generation.

Moreover, CosyVoice supports fine-grained control over the emotion and rhythm of generated speech in the form of rich text or natural language, which significantly improves the emotional expressiveness of raw audio.

Without further ado, let’s take a look at the uses and effects of FunAudioLLM.

What can FunAudioLLM be used for?

Based on the SenseVoice and CosyVoice models, FunAudioLLM can support more human-computer interaction application scenarios, such as multilingual speech translation with timbre and emotion generation, emotional voice conversations, interactive podcasts, audiobooks, etc.

Simultaneous Interpretation: Multilingual Translation Simulating Timbre and Emotion

By combining SenseVoice, LLM, and CosyVoice, speech-to-speech translation (S2ST) can be performed seamlessly .

It should be noted that the original recording will be displayed in bold in the text. This integrated approach not only improves the efficiency and fluency of the translation, but also by sensing the emotions and intonation in the speech, it can reproduce the emotional color of the original speech in the translation, making the conversation more real and moving.

Whether it is multilingual conference interpretation, cross-cultural communication, or providing instant voice translation services for non-native speakers, this technology will greatly narrow the language gap and information loss in communication.

for example: