Article count:10350 Read by:146647018

Account Entry

Geely, the biggest AI "dark horse" in the automotive industry: self-developed voice model tops the list, with performance exceeding SOTA by 10%

Latest update time:2024-09-23
    Reads:
Jia Haonan from Aofei Temple
Quantum Bit | Public Account QbitAI

In the field of large-scale speech synthesis models, the champion changed hands overnight.

The latest HAM-TTS large model has made significant improvements in pronunciation accuracy, naturalness, and speaker similarity compared to the previous SOTA achievement VALL-E.

The main research team behind it is the most unexpected "dark horse" in the LLM track this year:

Geely Automobile .

That’s right, it’s not an AI native company, nor a traditional technology giant, but Geely, which is well-known for its cars but is constantly demonstrating its hard technology capabilities.

What is the use of Geely Xingrui AI large model?

The full name of Geely's self-developed voice model HAM-TTS is:

Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech, literally translated as Token-based zero-shot text-to-speech hierarchical acoustic modeling is an important member of the StarRui AI large model system.
As the name suggests, for the smart cockpit experience, this technology acts on the most critical interaction link: "pronunciation" .

There are usually several evaluation indicators to evaluate whether a voice assistant is good or not:

Pronunciation accuracy is judged by Character Error Rate (CER), which is scored by the well-known end-to-end speech model platform ESPNet.

The speaking style consistency (NMOS), pitch consistency (SMOS), and overall score (MOS) were subjectively judged and scored by a team of 60 people recruited by the research team.

Overall, with the same scale of about 400 million parameters, the character error rate of the HAM-TTS model is about 1.5% lower than that of the SOTA VALL-E model.

The full 800 million parameter HAM-TTS model has a 2.3% drop in character error rate compared to VALL-E.

The HAM-TTS model has an improvement of about 10% in style consistency, pitch consistency and overall score.

In the interactive scenarios of the smart cockpit, such as linkage with virtual images, customized personalities, voice navigation, news broadcasts, picture book readings, storytelling, live broadcasts, etc., the powerful technical support capabilities provided by the Starrui Voice Big Model are indispensable.

The Xingrui voice model has better recognition capabilities and knows how to better maintain the timbre stability and continuity of the speaker's voice without sudden changes in timbre.

And most importantly, the Xingrui voice model's ability to reproduce sound only requires a minimum of 3 seconds of sample input, which is a significant improvement over the industry's general 10-second sample requirement.