Geely, the biggest AI "dark horse" in the automotive industry: self-developed voice model tops the list, with performance exceeding SOTA by 10%
Jia Haonan from Aofei Temple
Quantum Bit | Public Account QbitAI
In the field of large-scale speech synthesis models, the champion changed hands overnight.
The latest HAM-TTS large model has made significant improvements in pronunciation accuracy, naturalness, and speaker similarity compared to the previous SOTA achievement VALL-E.
The main research team behind it is the most unexpected "dark horse" in the LLM track this year:
Geely Automobile .
That’s right, it’s not an AI native company, nor a traditional technology giant, but Geely, which is well-known for its cars but is constantly demonstrating its hard technology capabilities.
What is the use of Geely Xingrui AI large model?
The full name of Geely's self-developed voice model HAM-TTS is:
Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech, literally translated as
Token-based zero-shot text-to-speech hierarchical acoustic modeling
is an important member of the StarRui AI large model system.
As the name suggests, for the smart cockpit experience, this technology acts on the most critical interaction link:
"pronunciation"
.
There are usually several evaluation indicators to evaluate whether a voice assistant is good or not:
Pronunciation accuracy is judged by Character Error Rate (CER), which is scored by the well-known end-to-end speech model platform ESPNet.
The speaking style consistency (NMOS), pitch consistency (SMOS), and overall score (MOS) were subjectively judged and scored by a team of 60 people recruited by the research team.
Overall, with the same scale of about 400 million parameters, the character error rate of the HAM-TTS model is about 1.5% lower than that of the SOTA VALL-E model.
The full 800 million parameter HAM-TTS model has a 2.3% drop in character error rate compared to VALL-E.
The HAM-TTS model has an improvement of about 10% in style consistency, pitch consistency and overall score.
In the interactive scenarios of the smart cockpit, such as linkage with virtual images, customized personalities, voice navigation, news broadcasts, picture book readings, storytelling, live broadcasts, etc., the powerful technical support capabilities provided by the Starrui Voice Big Model are indispensable.
The Xingrui voice model has better recognition capabilities and knows how to better maintain the timbre stability and continuity of the speaker's voice without sudden changes in timbre.
Whether it is the relaxed atmosphere of telling jokes in a professional news broadcast or the warm moments of reading picture books, it can also intelligently adjust multi-dimensional parameters such as tone, intonation, pauses and emotions according to the needs of specific scenarios. Users can have a more immersive, natural and vivid personalized voice interaction experience.
Secondly, it can seamlessly switch between languages . No matter which language or dialect the user provides, it can fluently use Chinese or English for speech synthesis while maintaining the same timbre.
You can input your dialect and the system can directly convert it into Mandarin or even other dialects
.
It now supports synthesis of multiple dialects such as Sichuanese, Cantonese, and Northeastern Chinese, and even supports cross-language speech synthesis in Japanese, Korean, and Southeast Asian languages.
And most importantly, the Xingrui voice model's ability to reproduce sound only requires a minimum of 3 seconds of sample input, which is a significant improvement over the industry's general 10-second sample requirement.
This is actually the greatest academic value of the Starrui Voice Model at the user experience level - through innovative sound synthesis technology and data enhancement strategies, it improves the performance and training cost of the TTS model.
How did Geely do it?
TTS models have been widely used in various interactive applications of text-to-speech. The general model is the three steps of "text processing - extraction of acoustic features - speech synthesis".
The first two steps have standard rule-based algorithms, and neural networks are usually used in the final speech synthesis step, and the models are usually not large. For example, VALL-E , the pioneering speech synthesis model, is not very large in scale, judging from the training configuration of 16 V100 GPUs, with about 400 million parameters.
However, when the input text is directly concatenated with the speech token as the input of the large model, there is insufficient semantic information to constrain the model, or the text and speech are not "aligned". This results in the low pronunciation accuracy, inconsistent speaking style and timbre of the traditional TTS model.
This problem can be solved by using large amounts of diverse training data, but this will increase the development cycle and cost.
Geely's solution is to introduce a hierarchical acoustic modeling method into the traditional TTS model structure :
Specifically, a Text-to-LVS predictor is introduced, which predicts latent variables containing important acoustic and semantic information from text as supplementary information. In the inference stage, these latent variable information, together with the text prompt information, are used as the input of the large model.
This can significantly improve the pronunciation errors and style changes in the synthesized speech. In the training process, data segments are replaced and copied to improve the uniformity of the timbre.
During the training phase, an aligner* (Text-HuBERT Aligner) is introduced into the model to generate supervised LVS to assist in the training of the Text-to-LVS predictor. It aligns the text (phoneme) sequence with the HuBERT features of the speech to generate a supervised LVS sequence with the same length as the phoneme sequence.
After extracting the audio features, K-means clustering processing is also introduced to remove the speaker's personalized information in the original audio features, so that the model can focus more on the common features of the speech, thereby improving the generalization ability of the model and the timbre consistency of the synthesized speech.
While improving the accuracy of speech synthesis, the team also adopted a sound conversion pre-training model based on the UNet architecture to generate a large amount of synthetic speech data with different timbres but the same content, so as to increase the diversity and quantity of training data, thereby improving the performance and generalization ability of the TTS model.
First, the HuBERT features and fundamental frequency (F0) are extracted from the speech data, and then these features are input into a ResNet model for processing. Subsequently, the data is downsampled by encoding and upsampled by decoding, and finally restored to an audio signal. At each step of the decoder upsampling stage, the target speaker embedding features are introduced to achieve the effect of changing the timbre of the speech without changing the content of the speech.
This achieves three goals at once. First, it solves the problem of insufficient real data. Second, it avoids copyright and privacy risks and effectively solves the problem of data sparsity (such as rare pronunciations, specific accents or intonations) .
The HAM-TTS model was trained using real and synthetic data of different combinations and sizes. The results showed that the performance of the model was most significantly improved when real and synthetic data were combined for training.
Geely’s voice model has reached SOTA, how to interpret it?
Geely is using its algorithmic capabilities to propose solutions for the smart cockpit corner case, which was previously ignored by various manufacturers. The goal is to improve the experience of the "last mile" of smart cars.
This part of R&D is the most time-consuming and labor-intensive, and requires the highest technical capabilities:
Not only do we need to understand the advantages of the most advanced model, but we also need to figure out its shortcomings and propose targeted improvements .
When it comes to the book AI Big Model, most car manufacturers feel overwhelmed after just reading the "Preface", but Geely not only understands it thoroughly, but also makes "annotations".
Moreover, he is the actual first author of the paper, and most of the team members are Geely scientists - there is no dispute over the "ownership" of the Xingrui voice model.
In the automotive industry where "self-research" has been repeatedly redefined, Geely is a breath of fresh air.
Following this line of thought, we found many more examples like this.
For example, Geely Xingrui's AI big model system includes three basic models: language big model, multimodal big model, and digital twin big model. From this, the NLP language big model, NPDS R&D big model, multimodal perception big model, multimodal generation big model, AI DRIVE big model, digital life big model, etc. are derived, building the AI technology foundation of the entire smart car.
For example, in terms of computing power, the total cloud computing power of the Smart Computing Center has expanded from 8.1 quadrillion times per second last year to 10.2 quadrillion times per second.
What is reflected behind the Xingrui Voice Big Model is Geely’s “technological explosion” : its algorithmic capabilities, big model systematization capabilities, and data capabilities are leading the industry, and also provide the industry with new options.
This is Geely's stunning achievement in the field of intelligence after its successful start in electrification.
But for Geely, the overall development goes beyond this. In recent years, Geely has not only invested in core technologies related to the automotive business, but has also continued to play a leading role in the broader underlying technology level. In the most core technological breakthroughs such as satellites, chips, and operating systems, Geely's strength is becoming increasingly visible.
It’s time to re-understand Geely.
Paper address: https://arxiv.org/abs/2403.05989
-over-
Click here ???? Follow me, remember to mark the star~