Jia Lei, the big shot who returned to Baidu, served up a dish of AI voice technology
▲Click above Leifeng.com Follow
How Baidu reduced its error rate by 30%
Text | Camel & Liu Fangping
On November 28, Baidu Brain Speech Capability Engine Forum was held in Beijing. At the forum, Baidu CTO Wang Haifeng announced that Baidu's speech technology has exceeded 10 billion daily calls, ranking first in China.
Subsequently, Jia Lei, the chief architect of Baidu Voice, who announced his return to Baidu in August this year, released a new technology for intelligent voice interaction - "integrated end-to-end modeling technology of speech enhancement and acoustic modeling based on complex convolutional neural network (CNN)".
Jia Lei said that Baidu's speech enhancement and acoustic modeling integrated end-to-end modeling technology based on complex convolutional neural networks (CNN) abandoned various prior assumptions of digital signal processing and speech recognition disciplines, eliminated barriers between disciplines, and directly performed integrated modeling end-to-end. Compared with the traditional microphone array algorithm based on digital signal processing, the error rate is reduced by more than 30%. He also pointed out that Google, which uses a similar approach, has only reduced its relative error rate by 16%. At present, this method has been integrated into Baidu's latest Honghu chip.
Three hardware products based on Honghu voice chips equipped with this voice technology were also released at the forum:
Chip module DSP chip + Flash
Android development board DSP chip + RK3399
RTOS development board DSP chip + ESP32
In addition, Jia Lei also described the end-to-end hardware-software integrated far-field voice interaction solution based on the Honghu voice chip and the newly released three scenario solutions for smart homes, smart cars, and smart IoT devices.
"Today I am conservatively reporting a performance improvement of more than 30%, which is very conservative. In the future, this technology will once again significantly refresh people's understanding of far-field speech. My own judgment is that within three years, the recognition rate of far-field speech technology will reach the recognition rate of near-field speech technology, because with this technology, the problem of far-field recognition can be basically solved. This is a major interdisciplinary innovation."
Talking about Baidu's recent technological breakthrough in voice technology, Baidu Voice Chief Architect Jia Lei became excited.
Let’s take a look at how Baidu achieved a 30% reduction in error rate.
Let’s start with the traditional method.
At present, speech recognition technology performs well in high signal-to-noise ratio scenarios, but often performs unstably in low signal-to-noise ratio scenarios. Far-field speech recognition is a typical low signal-to-noise ratio scenario. In a far-field environment, the target sound source is far away from the microphone, which will cause the target signal to be severely attenuated. In addition, the environment is noisy and there are many interfering signals, which ultimately leads to a low signal-to-noise ratio and poor speech recognition performance. A user standing 3 meters or even 5 meters away and interacting with a smart speaker is a typical far-field speech recognition application scenario.
Traditionally, microphone arrays are used as pickups to improve the accuracy of far-field speech recognition. Multi-channel speech signal processing technology is used to enhance the target signal and improve speech recognition accuracy.
At present, the multi-channel speech recognition system used by most of the smart speaker product systems on sale is composed of a front-end enhancement module and a back-end speech recognition acoustic modeling module in series:
The front-end enhancement module usually includes direction of arrival estimation (DOA) and beamforming (BF). DOA technology is mainly used to estimate the direction of the target sound source, while BF technology uses the azimuth information of the target sound source to enhance the target signal and suppress the interference signal.
The back-end speech recognition acoustic modeling module will perform deep learning modeling on this enhanced speech signal. This modeling process is completely similar to the modeling process of near-field speech recognition on mobile phones, except that the input signal to the modeling process is not a near-field signal collected by the mobile phone microphone, but an enhanced signal enhanced by microphone array digital signal processing technology.
In recent years, front-end speech enhancement technology has gradually begun to use deep learning to perform arrival direction estimation (DOA) and beamforming (BF). Many papers and products have also mentioned the use of deep learning technology to replace traditional digital signal processing technology in microphone array systems, and some improvements have been achieved.
but,
1) The beam area pickup method has limitations. The above-mentioned speech enhancement technologies mostly use MSE-based optimization criteria, which makes the speech in the beam clearer and the background noise outside the beam smaller from the auditory perception. However, auditory perception and recognition rate are not completely consistent. Moreover, when the noise content is also speech content (for example, when the TV and the person are in the same direction), the performance of this method will drop sharply.
2) The optimization goals of the enhancement and recognition modules are inconsistent. The optimization process of the front-end speech enhancement module is independent of the back-end recognition module. This optimization goal is inconsistent with the final goal of the back-end recognition system. The inconsistency of the goals is likely to result in the optimization result of the front-end enhancement module being suboptimal in terms of the final goal.
3) The real product environment is complex, and traditional methods will affect the user experience. Due to the complex sound source environment in real product scenarios, most products first determine the direction of the sound source by DOA, and then use beamforming to form a beam in that direction, thereby improving the signal-to-noise ratio of the signal within the beam and suppressing the interference of noise outside the beam. This mechanism makes the working effect of the entire system heavily dependent on the accuracy of sound source positioning. At the same time, when the user says the wake-up word or voice command for the first time, it is difficult for the first voice to accurately use the beam information (for example, after you finish a sentence, you change to another direction), which affects the first wake-up rate and the first sentence recognition rate.
In 2017, the Google team first proposed using neural networks to solve the integrated modeling problem of front-end speech enhancement and speech acoustic modeling.
Starting from the Filter-and-Sum method of signal processing, the article first derives the model structure in the time domain, and then further derives the model structure FCLP (Factored Complex Linear Projection) in the frequency domain, which greatly reduces the amount of calculation compared to the time domain model.
This structure extracts features in multiple directions from multi-channel speech through spatial filtering and frequency domain filtering, and then sends the features to the back-end recognition model to ultimately achieve joint optimization of the network.
The FCLP structure proposed by Google is still based on signal processing methods. It originates from the delay and sum filter and uses a deep learning network to simulate and approximate signal beams. Therefore, it is also limited by some prior assumptions of signal processing methods.
For example, the lowest layer of FCLP does not mine the correlation information between frequency bands, and there is a problem of insufficient use of multi-channel microphone information, which affects the model accuracy of the deep learning modeling process.
For example, the number of beam looking directions is defined as less than 10, which mainly corresponds to the beam space division in the digital signal processing process. This deep learning model structure design that must be aligned with the digital signal processing process has seriously affected the development and extension of deep learning technology in this direction, limited the evolution of the model structure of the deep learning model, and constrained the innovation and development of technology.
Finally, Google Scholar reported that this method achieved a 16% relative error rate reduction compared to the traditional digital signal processing-based microphone array algorithm.
Baidu adopted a similar idea,
that is, end-to-end modeling of "integration of speech enhancement and speech acoustic modeling", but they used "complex number-based convolutional neural network"
.
Compared with Google's method, this method completely abandons the prior knowledge of digital signal processing, and the model structure design is completely decoupled from the digital signal processing discipline, giving full play to the advantages of the multi-layer structure and multi-channel feature extraction of the CNN network.
Specifically, the model uses complex CNN as the core at the bottom, and uses the complex CNN network to mine the essential characteristics of physiological signals. It uses complex CNN, complex fully connected layers, and CNN and other multi-layer networks to directly extract multi-scale and multi-level information from the original multi-channel speech signal, and fully mine the correlation and coupling information between frequency bands.
Under the premise of retaining the original feature phase information, this model simultaneously realizes front-end sound source localization, beamforming and enhanced feature extraction. The features abstracted by the CNN at the bottom of the model are directly fed into the end-to-end streaming multi-level truncated attention model (SMLTA), thus realizing end-to-end integrated modeling from original multi-channel microphone signals to target text recognition.
The optimization criteria of the entire network are completely dependent on the optimization criteria of the speech recognition network, and the model parameters are tuned with the goal of improving the recognition rate.
Jia Lei said: "Our model can extract the essential characteristics of biological signals. In contrast, Google's system assumes that the information between the frequency bands corresponding to the two microphone signals is related, which does not mine the information between the frequency bands. This is also the reason why Google's recognition rate is relatively low."
As mentioned earlier, compared with the method used by Baidu's smart speaker online products that combines a front-end enhancement module based on traditional digital signal processing and a back-end speech recognition acoustic modeling process in series, this integrated end-to-end modeling technology of speech enhancement and acoustic modeling based on complex convolutional neural networks has achieved a reduction in error rate of more than 30%.
In addition, Jia Lei listed five features of this end-to-end speech recognition in his speech:
It is worth mentioning here that Baidu's integrated modeling solution has been integrated into Baidu's latest Honghu chip, and the network occupies less than 200K of memory.
A 30% reduction is also the largest product performance improvement in recent deep learning far-field recognition technology.
Jia Lei believes that this reveals that "end-to-end modeling" will be an important development direction for far-field speech recognition industry applications.
Jia Lei then added:
“Human voice interaction is essentially far-field. Near-field voice interaction with a mobile phone microphone placed next to the mouth is just a limitation that people made when they first did voice recognition because they could not solve the far-field recognition problem.If far-field voice technology matures in the next three years, all voices will be awakened by far-field voices, and any home appliance or car device can be input continuously after awakening, and it can carry voice interaction functions to conduct queries in this field. So the maturity of this technology means that far-field voice recognition will enter thousands of households, and all the devices we see will be based on far-field voice interaction. If combined with the development of chips, voice recognition and voice synthesis will be integrated to solve human terminal interaction. I think it is worth looking forward to. "
When the reporter asked Dr. Jia Lei whether the relevant technology had been written into a paper for publication, Jia Lei said hurriedly, "I am too busy and have no time to write a paper." This may be what a national model worker looks like, too busy to write a paper on his achievements.
Previous recommendations
▎ QQ’s anxieties are all written into WeChat mini-programs
▎The world's fastest 5G! MediaTek releases Dimensity 1000, AI running score breaks record
▎How did Wang Jian become an academician of the Chinese Academy of Engineering?
Leifeng.com Annual Selection
Find the best AI implementation practices in 19 industries
The "AI Best Gold Mining Case Annual List", founded in 2017, is the industry's first artificial intelligence business case selection activity. Leifeng.com starts from the commercial dimension and seeks the best implementation practices of artificial intelligence in various industries.