What are the speech recognition algorithms?
This article lists several different speech recognition algorithms.
The first one: algorithm based on dynamic time warping
It is still the mainstream method in continuous speech recognition.
This method has a large amount of computation, but is technically simple and has a high recognition accuracy.
In small vocabulary and isolated word recognition systems, many improved DTW algorithms have been proposed, for example, a method for isolated word recognition using a frequency-scale DTW algorithm.
The second method: the hidden Markov model (HMM) based on the parameter model
This algorithm is mainly used in speech recognition systems with large vocabulary. It requires more model training data, longer training and recognition time, and also requires a larger memory space.
Generally speaking, the continuous hidden Markov model requires more computation than the discrete hidden Markov model, but the recognition rate is higher.
The third method: vector quantization (VQ) based on non-parametric model
The method requires very little model training data, training and recognition time, and working storage space.
However, the recognition performance of the VQ algorithm for large vocabulary speech recognition is not as good as that of HMM.
It has been well applied in isolated character (word) speech recognition systems.
In addition, there are algorithms based on artificial neural networks (ANN) and hybrid algorithms, such as ANN/HMM method, FSVQ/HMM method, etc.
More speech recognition algorithms are as follows:
Convolutional Neural Networks
Deep Learning Neural Networks
BP Neural Network
RBF Neural Network
Fuzzy Clustering Neural Network
Improved TS fuzzy neural network
Recurrent Neural Networks
Wavelet Neural Network
Chaotic Neural Network
Wavelet Chaotic Neural Network
Neural Networks and Genetic Algorithms
Dynamically Optimizing Neural Networks
K-means and neural network ensembles
Combination of HMM and Self-organizing Neural Network
Orthogonal Basis Function Counter-propagation Process Neural Network
HMM and a new feed-forward neural network
Random mapping of feature space
SVM multi-class classification algorithm
Normalization of feature parameters
Multiband spectral subtraction
Independent Perception Theory
Segmented Fuzzy Clustering Algorithm VQ-HMM
Optimized competition algorithm
Double Gaussian GMM feature parameters
MFCC and GMM
MFCCs and PNN
SBC and SMM
MEL cepstral coefficients and vector quantization
DTW
LPCC and MFCC
Hidden Markov Model HMM
Speech Recognition Feature Extraction Method
Speech recognition has the following requirements for feature parameters:
1. Able to convert speech signals into speech feature vectors that can be processed by computers
2. Be able to conform to or be similar to the auditory perception characteristics of the human ear
3. It can enhance speech signals and suppress non-speech signals to a certain extent
The commonly used feature extraction methods are as follows:
(1) Linear Prediction Coefficients (LPC)
The principle of human-like vocalization is obtained by analyzing the model of the cascade of vocal tract short tubes. Assuming that the transfer function of the system is similar to that of a full-pole digital filter, 12-16 poles are usually enough to describe the characteristics of the speech signal. Therefore, for the speech signal at time n, we can use the linear combination of the signals at previous times to approximate the simulation. Then calculate the sampling value of the speech signal and the sampling value of the linear prediction. And minimize the mean square error (MSE) between the two, and you can get LPC.
(2) Perceptual Linear Predictive (PLP)
A characteristic parameter based on an auditory model. This parameter is a feature equivalent to LPC and is also a set of coefficients of the full-pole model prediction polynomial. The difference is that PLP is based on human hearing and is applied to spectrum analysis through calculation. The input speech signal is processed by the human hearing model to replace the time domain signal used by LPC. This has the advantage of being conducive to the extraction of noise-resistant speech features.
(3) Tandem and Bottleneck features
These are two types of features extracted using neural networks. Tandem features are obtained by reducing the dimensionality of the posterior probability vector of the category corresponding to the node in the output layer of the neural network and concatenating it with features such as MFCC or PLP. Bottleneck features are extracted using a neural network with a special structure. The number of nodes in one hidden layer of this neural network is much smaller than that in other hidden layers, so it is called the Bottleneck layer, and the output feature is the Bottleneck feature.
(4) Fbank features based on filter banks (Filterbank)
Also known as MFSC, the Fbank feature extraction method is equivalent to MFCC without the last step of discrete cosine transform. Compared with MFCC features, Fbank features retain more original speech data.
(5) Linear Predictive Cepstral Coefficient (LPCC)
Based on the important characteristic parameters of the vocal tract model, LPCC discards the excitation information in the signal generation process. Then more than ten cepstral coefficients can represent the characteristics of the resonance peak. Therefore, it can achieve good performance in speech recognition.
(6) Mel Frequency Cepstrum Coefficient (MFCC)
Based on the hearing characteristics of the human ear, the Mel frequency cepstrum frequency band division is equally spaced on the Mel scale. The logarithmic distribution relationship between the frequency scale value and the actual frequency is more in line with the hearing characteristics of the human ear, so it can make the speech signal have a better representation. It was developed by Davis and Mermelstein in 1980. Since then, MFCC has been a standout in the field of speech recognition.
Q: Why is MFCC so popular?
People produce sounds through the vocal tract, and the shape of the vocal tract determines what kind of sound is produced. The shape of the vocal tract includes the tongue, teeth, etc. If we can accurately know this shape, then we can accurately describe the phoneme produced. The shape of the vocal tract is shown in the envelope of the short-time power spectrum of speech. And MFCC is a feature that accurately describes this envelope.
Spectrogram
When processing speech signals, how to describe them is very important, because different descriptions show different information, and the spectrogram description method is the most conducive to observation and understanding.
As can be seen from the figure above, this speech is divided into many frames, and each frame of speech corresponds to a spectrum (calculated by short-time FFT), which represents the relationship between frequency and energy. In actual use, there are three types of spectrum diagrams, namely linear amplitude spectrum, logarithmic amplitude spectrum, and autopower spectrum (the amplitude of each spectrum line in the logarithmic amplitude spectrum is calculated logarithmically, so the unit of its vertical axis is dB (decibel). The purpose of this transformation is to pull those components with lower amplitudes higher relative to those with higher amplitudes, so as to observe periodic signals hidden in low-amplitude noise).
First, the spectrum of one frame of speech is represented by coordinates, as shown in Figure (a). Rotate 90 degrees to get Figure (b). Map these amplitudes to a grayscale representation to get Figure (c). The reason for this operation is to increase the time dimension and get a spectrum that changes over time. This is the spectrogram that describes the speech signal. In this way, the spectrum of a segment of speech can be displayed instead of a frame of speech, and static and dynamic information can be intuitively seen.
Cepstrum Analysis
Below is a spectrum of speech. The peaks represent the main frequency components of the speech. We call these peaks formants, which carry the identification properties of the sound and can be used to identify different sounds. Therefore, we need to extract them. We need to extract not only the positions of the formants, but also the process of their transformation. So we extract the spectral envelope. This envelope is a smooth curve connecting these formant points.
As can be seen from the figure above, the original spectrum consists of two parts: the envelope and the spectrum details. Therefore, we need to separate these two parts to get the envelope. Decompose according to the method in the figure below. Based on the given logX[k], we can obtain logH[k] and logE[k] to satisfy logX[k]=logH[k]+logE[k].
From the above figure, we can see that the envelope is mainly low-frequency components, while the high frequency is mainly the details of the spectrum. The superposition of the two is the original spectrum signal. That is, h[k] is the low-frequency part of x[k], so by passing x[k] through a low-pass filter, we can get h[k], which is the envelope of the spectrum.
The professional term for the above unwinding process is called homomorphic signal processing (another method is based on linear transformation). The speech itself can be regarded as a response function of the vocal tract impact information (including speaker personality information and semantic information, expressed as low-frequency components of the spectrum) after glottal excitation, which is expressed in the form of convolution in the time domain. In order to separate the two and obtain the vocal tract resonance characteristics and fundamental frequency period, it is necessary to convert this nonlinear problem into a linear problem. The first step is to convert it into a multiplicative signal through FFT (convolution in the time domain is equivalent to product in the frequency domain); the second step is to convert the multiplicative signal into an additive signal by taking the logarithm; the third step is to perform an inverse transformation to restore it to a convolution signal. At this time, although both the front and back are time domain sequences, the discrete time domains they are in are obviously different, so the latter is called the cepstrum frequency domain. The calculation process is shown in the figure below.
Previous article:How to implement voice recognition_How to set up mobile phone voice recognition
Next article:Advantages and disadvantages of speech recognition_Introduction to speech recognition function
- Popular Resources
- Popular amplifiers
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications