What are the speech recognition algorithms? Speech recognition feature extraction methods-EEWORLD

Collect

What are the speech recognition algorithms?

This article lists several different speech recognition algorithms.

The first one: algorithm based on dynamic time warping

It is still the mainstream method in continuous speech recognition.

This method has a large amount of computation, but is technically simple and has a high recognition accuracy.

In small vocabulary and isolated word recognition systems, many improved DTW algorithms have been proposed, for example, a method for isolated word recognition using a frequency-scale DTW algorithm.

The second method: the hidden Markov model (HMM) based on the parameter model

This algorithm is mainly used in speech recognition systems with large vocabulary. It requires more model training data, longer training and recognition time, and also requires a larger memory space.

Generally speaking, the continuous hidden Markov model requires more computation than the discrete hidden Markov model, but the recognition rate is higher.

The third method: vector quantization (VQ) based on non-parametric model

The method requires very little model training data, training and recognition time, and working storage space.

However, the recognition performance of the VQ algorithm for large vocabulary speech recognition is not as good as that of HMM.

It has been well applied in isolated character (word) speech recognition systems.

In addition, there are algorithms based on artificial neural networks (ANN) and hybrid algorithms, such as ANN/HMM method, FSVQ/HMM method, etc.

More speech recognition algorithms are as follows:

Convolutional Neural Networks

Deep Learning Neural Networks

BP Neural Network

RBF Neural Network

Fuzzy Clustering Neural Network

Improved TS fuzzy neural network

Recurrent Neural Networks

Wavelet Neural Network

Chaotic Neural Network

Wavelet Chaotic Neural Network

Neural Networks and Genetic Algorithms

Dynamically Optimizing Neural Networks

K-means and neural network ensembles

Combination of HMM and Self-organizing Neural Network

Orthogonal Basis Function Counter-propagation Process Neural Network

HMM and a new feed-forward neural network

Random mapping of feature space

SVM multi-class classification algorithm

Normalization of feature parameters

Multiband spectral subtraction

Independent Perception Theory

Segmented Fuzzy Clustering Algorithm VQ-HMM

Optimized competition algorithm

Double Gaussian GMM feature parameters

MFCC and GMM

MFCCs and PNN

SBC and SMM

MEL cepstral coefficients and vector quantization

DTW

LPCC and MFCC

Hidden Markov Model HMM

Speech Recognition Feature Extraction Method

Speech recognition has the following requirements for feature parameters:

1. Able to convert speech signals into speech feature vectors that can be processed by computers

2. Be able to conform to or be similar to the auditory perception characteristics of the human ear

3. It can enhance speech signals and suppress non-speech signals to a certain extent

The commonly used feature extraction methods are as follows:

(1) Linear Prediction Coefficients (LPC)

The principle of human-like vocalization is obtained by analyzing the model of the cascade of vocal tract short tubes. Assuming that the transfer function of the system is similar to that of a full-pole digital filter, 12-16 poles are usually enough to describe the characteristics of the speech signal. Therefore, for the speech signal at time n, we can use the linear combination of the signals at previous times to approximate the simulation. Then calculate the sampling value of the speech signal and the sampling value of the linear prediction. And minimize the mean square error (MSE) between the two, and you can get LPC.

(2) Perceptual Linear Predictive (PLP)

A characteristic parameter based on an auditory model. This parameter is a feature equivalent to LPC and is also a set of coefficients of the full-pole model prediction polynomial. The difference is that PLP is based on human hearing and is applied to spectrum analysis through calculation. The input speech signal is processed by the human hearing model to replace the time domain signal used by LPC. This has the advantage of being conducive to the extraction of noise-resistant speech features.

(3) Tandem and Bottleneck features

These are two types of features extracted using neural networks. Tandem features are obtained by reducing the dimensionality of the posterior probability vector of the category corresponding to the node in the output layer of the neural network and concatenating it with features such as MFCC or PLP. Bottleneck features are extracted using a neural network with a special structure. The number of nodes in one hidden layer of this neural network is much smaller than that in other hidden layers, so it is called the Bottleneck layer, and the output feature is the Bottleneck feature.

(4) Fbank features based on filter banks (Filterbank)

Also known as MFSC, the Fbank feature extraction method is equivalent to MFCC without the last step of discrete cosine transform. Compared with MFCC features, Fbank features retain more original speech data.

(5) Linear Predictive Cepstral Coefficient (LPCC)

Based on the important characteristic parameters of the vocal tract model, LPCC discards the excitation information in the signal generation process. Then more than ten cepstral coefficients can represent the characteristics of the resonance peak. Therefore, it can achieve good performance in speech recognition.

(6) Mel Frequency Cepstrum Coefficient (MFCC)

Based on the hearing characteristics of the human ear, the Mel frequency cepstrum frequency band division is equally spaced on the Mel scale. The logarithmic distribution relationship between the frequency scale value and the actual frequency is more in line with the hearing characteristics of the human ear, so it can make the speech signal have a better representation. It was developed by Davis and Mermelstein in 1980. Since then, MFCC has been a standout in the field of speech recognition.

Q: Why is MFCC so popular?

People produce sounds through the vocal tract, and the shape of the vocal tract determines what kind of sound is produced. The shape of the vocal tract includes the tongue, teeth, etc. If we can accurately know this shape, then we can accurately describe the phoneme produced. The shape of the vocal tract is shown in the envelope of the short-time power spectrum of speech. And MFCC is a feature that accurately describes this envelope.

Spectrogram

When processing speech signals, how to describe them is very important, because different descriptions show different information, and the spectrogram description method is the most conducive to observation and understanding.